NETWORK ANOMALY DETECTION

BHAVESHKUMAR THAKER

BUSINESS PROBLEM

To build network intrusion detection system to detect anomalies and attacks in the Network. There are two problems.

  1. Binomial Classification: Activity is normal or attack
  2. Multinomial classification: Activity is normal or DOS or PROBE or R2L or U2R

Basic features of individual TCP connections

feature name description type
duration length (number of seconds) of the connection continuous
protocol_type type of the protocol, e.g. tcp, udp, etc. discrete
service network service on the destination, e.g., http, telnet, etc. discrete
src_bytes number of data bytes from source to destination continuous
dst_bytes number of data bytes from destination to source continuous
flag normal or error status of the connection discrete
land 1 if connection is from/to the same host/port; 0 otherwise discrete
wrong_fragment number of "wrong" fragments continuous
urgent number of urgent packets continuous

Content features within a connection suggested by domain knowledge

feature name description type
hot number of "hot" indicators continuous
num_failed_logins number of failed login attempts continuous
logged_in 1 if successfully logged in; 0 otherwise discrete
num_compromised number of "compromised" conditions continuous
root_shell 1 if root shell is obtained; 0 otherwise discrete
su_attempted 1 if "su root" command attempted; 0 otherwise discrete
num_root number of "root" accesses continuous
num_file_creations number of file creation operations continuous
num_shells number of shell prompts continuous
num_access_files number of operations on access control files continuous
num_outbound_cmds number of outbound commands in an ftp session continuous
is_hot_login 1 if the login belongs to the "hot" list; 0 otherwise discrete
is_guest_login 1 if the login is a "guest"login; 0 otherwise discrete

Traffic features computed using a two-second time window

feature name description type
count number of connections to the same host as the current connection in the past two seconds continuous
Note: The following features refer to these same-host connections.
serror_rate % of connections that have "SYN" errors continuous
rerror_rate % of connections that have "REJ" errors continuous
same_srv_rate % of connections to the same service continuous
diff_srv_rate % of connections to different services continuous
srv_count number of connections to the same service as the current connection in the past two seconds continuous
Note: The following features refer to these same-service connections.
srv_serror_rate % of connections that have "SYN" errors continuous
srv_rerror_rate % of connections that have "REJ" errors continuous
srv_diff_host_rate % of connections to different hosts continuous

Expectation

  1. User must upload solution in .IPYNB Python/R File with appropriate comment
  2. File Name must start with “EMPID_"
  3. User must upload complete EDA (Exploratory Data Analysis) Document
  4. Model evaluation should be done using AUC

Import and Load Packages

In [1]:
import time
notebookstart = time.time()
In [2]:
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter(action='ignore', category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=UserWarning)
In [3]:
import platform
import sys
import importlib
import multiprocessing
import random
In [4]:
import numpy as np
import pandas as pd

random.seed(321)
np.random.seed(321)

pd.options.display.max_columns = 9999
In [5]:
belize_light_flavor = [
    '#5899DA',
    '#E8743B',
    '#19A979',
    '#ED4A7B',
    '#945ECF',
    '#13A4B4',
    '#525DF4',
    '#BF399E',
    '#6C8893',
    '#EE6868',
    '#2F6497',
    ]
In [6]:
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

%matplotlib inline

mpl.rc('figure', figsize=(15, 12))
plt.figure(figsize=(15, 12))
plt.rcParams['figure.facecolor'] = 'lightcyan'
mpl.style.use('seaborn')
plt.style.use('seaborn')

belize_light_flavor_cmap = mpl.colors.ListedColormap(belize_light_flavor)

from IPython.display import set_matplotlib_formats
set_matplotlib_formats('retina')
<Figure size 1080x864 with 0 Axes>
In [7]:
import seaborn as sns

sns.set(rc={'figure.figsize': (15, 12)})
sns.set(
    context='notebook',
    style='darkgrid',
    font='sans-serif',
    font_scale=1.1,
    rc={'figure.facecolor': 'lightcyan', 'axes.facecolor': 'lightcyan'
        , 'grid.color': 'steelblue'},
    )
sns.color_palette(belize_light_flavor);
In [8]:
# https://anaconda.org/anaconda/plotly
# conda install -c anaconda plotly

plotly_check = importlib.util.find_spec("plotly")
found = plotly_check is not None
if found:
    import plotly
    import plotly.offline as py
    import plotly.graph_objs as go
    import plotly.figure_factory as ff
    from plotly import tools
    from plotly.offline import init_notebook_mode, iplot
    
    init_notebook_mode(connected=True)
else:
    !conda install --yes --prefix {sys.prefix} plotly
    
    import plotly
    import plotly.offline as py
    import plotly.graph_objs as go
    import plotly.figure_factory as ff
    from plotly import tools
    from plotly.offline import init_notebook_mode, iplot
    
    init_notebook_mode(connected=True)
In [9]:
from statsmodels.graphics.mosaicplot import mosaic
In [10]:
# https://anaconda.org/conda-forge/missingno
# conda install -c conda-forge missingno

missingno_check = importlib.util.find_spec("missingno")
found = missingno_check is not None
if found:
    import missingno as msno
else:
    !conda install --yes --prefix {sys.prefix} -c conda-forge missingno
    
    import missingno as msno
In [11]:
# https://anaconda.org/conda-forge/scikit-plot
# conda install -c conda-forge scikit-plot

scikitplot_check = importlib.util.find_spec("scikitplot")
found = scikitplot_check is not None
if found:
    import scikitplot as skplt
else:
    #!conda install --yes --prefix {sys.prefix} -c conda-forge scikit-plot
    !pip install scikit-plot
    
    import scikitplot as skplt
In [12]:
import sklearn

from sklearn.model_selection import train_test_split, cross_val_predict, cross_val_score
from sklearn.model_selection import StratifiedShuffleSplit, StratifiedKFold, GridSearchCV

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, MaxAbsScaler, Normalizer
from sklearn.preprocessing import LabelBinarizer, label_binarize

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.metrics import roc_auc_score, roc_curve, auc
from sklearn.metrics import average_precision_score, precision_recall_fscore_support

from sklearn.utils import shuffle
from sklearn.base import BaseEstimator, ClassifierMixin
In [13]:
from sklearn.cluster import KMeans
from sklearn.ensemble import RandomForestClassifier
In [14]:
# https://anaconda.org/conda-forge/xgboost
# conda install -c conda-forge xgboost

xgboost_check = importlib.util.find_spec('xgboost')
found = xgboost_check is not None
if found:
    import xgboost as xgb
    from xgboost import XGBClassifier
else:
    !conda install --yes --prefix {sys.prefix} -c conda-forge xgboost
    
    import xgboost as xgb
    from xgboost import XGBClassifier
In [15]:
# https://anaconda.org/conda-forge/catboost
# conda install -c conda-forge catboost

catboost_check = importlib.util.find_spec('catboost')
found = catboost_check is not None
if found:
    import catboost
    from catboost import CatBoostClassifier
else:
    #!conda install --yes --prefix {sys.prefix} -c conda-forge catboost
    !pip install catboost
    
    import catboost
    from catboost import CatBoostClassifier
In [16]:
# https://anaconda.org/conda-forge/lightgbm
# conda install -c conda-forge lightgbm

lightgbm_check = importlib.util.find_spec('lightgbm')
found = lightgbm_check is not None
if found:
    import lightgbm as lgbm
else:
    !conda install --yes --prefix {sys.prefix} -c conda-forge lightgbm
    
    import lightgbm as lgbm
In [17]:
# https://anaconda.org/conda-forge/scikit-optimize
# conda install -c conda-forge scikit-optimize

skopt_check = importlib.util.find_spec('skopt')
found = skopt_check is not None
if found:
    import skopt
    from skopt import BayesSearchCV
else:
    #!conda install --yes --prefix {sys.prefix} -c conda-forge scikit-optimize
    !pip install scikit-optimize
    
    import skopt
    from skopt import BayesSearchCV
In [18]:
# https://anaconda.org/conda-forge/hyperopt
# conda install -c conda-forge hyperopt

hyperopt_check = importlib.util.find_spec('hyperopt')
found = hyperopt_check is not None
if found:
    import hyperopt
    from hyperopt import fmin, hp, tpe, rand, Trials, space_eval, STATUS_OK, STATUS_FAIL
else:
    !conda install --yes --prefix {sys.prefix} -c conda-forge hyperopt
    
    import hyperopt
    from hyperopt import fmin, hp, tpe, rand, Trials, space_eval, STATUS_OK, STATUS_FAIL
In [19]:
import tensorflow as tf
tf.set_random_seed(321)

import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, BatchNormalization, GaussianNoise
from keras.callbacks import EarlyStopping
from keras import regularizers
Using TensorFlow backend.
In [20]:
print('Operating system version........', platform.platform())
print('Python version is............... %s.%s.%s' % sys.version_info[:3])
print('scikit-learn version is.........', sklearn.__version__)
print('pandas version is...............', pd.__version__)
print('numpy version is................', np.__version__)
print('matplotlib version is...........', mpl.__version__)
print('seaborn version is..............', sns.__version__)
print('plotly version is...............', plotly.__version__)
print('scikit-plot version is..........', skplt.__version__)
print('missingno version is............', msno.__version__)
print('xgboost version is..............', xgb.__version__)
print('catboost version is.............', catboost.__version__)
print('lightgbm version is.............', lgbm.__version__)
print('scikit-optimize version is......', skopt.__version__)
print('hyperopt version is.............', hyperopt.__version__)
print('tensorflow version is...........', tf.__version__)
print('keras version is................', keras.__version__)
Operating system version........ Windows-10-10.0.16299-SP0
Python version is............... 3.6.8
scikit-learn version is......... 0.20.3
pandas version is............... 0.24.2
numpy version is................ 1.16.2
matplotlib version is........... 3.0.2
seaborn version is.............. 0.9.0
plotly version is............... 3.7.0
scikit-plot version is.......... 0.3.7
missingno version is............ 0.4.1
xgboost version is.............. 0.80
catboost version is............. 0.13
lightgbm version is............. 2.2.1
scikit-optimize version is...... 0.5.2
hyperopt version is............. 0.2
tensorflow version is........... 1.13.1
keras version is................ 2.2.4

Define generic common methods

In [21]:
def getDatasetInformation(csv_filepath, is_corr_required=True):
    """
    Read CSV (comma-separated) file into DataFrame
    
    Returns,
    - DataFrame
    - DataFrame's shape
    - DataFrame's data types
    - DataFrame's describe
    - DataFrame's sorted unique value count
    - DataFrame's missing or NULL value count
    - DataFrame's correlation between numerical columns
    """

    dataset_tmp = pd.read_csv(csv_filepath, header=None, index_col=None)
    dataset_tmp.columns = columns_name

    dataset_tmp_shape = pd.DataFrame(list(dataset_tmp.shape),
            index=['No of Rows', 'No of Columns'], columns=['Total'])
    dataset_tmp_shape = dataset_tmp_shape.reset_index()

    dataset_tmp_dtypes = dataset_tmp.dtypes.reset_index()
    dataset_tmp_dtypes.columns = ['Column Names', 'Column Data Types']

    dataset_tmp_desc = pd.DataFrame(dataset_tmp.describe())
    dataset_tmp_desc = dataset_tmp_desc.transpose()

    dataset_tmp_unique = dataset_tmp.nunique().reset_index()
    dataset_tmp_unique.columns = ['Column Name', 'Unique Value(s) Count'
                                  ]

    dataset_tmp_missing = dataset_tmp.isnull().sum(axis=0).reset_index()
    dataset_tmp_missing.columns = ['Column Names',
                                   'NULL value count per Column']
    dataset_tmp_missing = \
        dataset_tmp_missing.sort_values(by='NULL value count per Column'
            , ascending=False)

    if is_corr_required:
        dataset_tmp_corr = dataset_tmp.corr(method='spearman')
    else:
        dataset_tmp_corr = pd.DataFrame()

    return [
        dataset_tmp,
        dataset_tmp_shape,
        dataset_tmp_dtypes,
        dataset_tmp_desc,
        dataset_tmp_unique,
        dataset_tmp_missing,
        dataset_tmp_corr,
        ]
In [22]:
def getHighlyCorrelatedColumns(dataset, NoOfCols=6):
    df_corr = dataset.corr()

    # set the correlations on the diagonal or lower triangle to zero,
    # so they will not be reported as the highest ones

    df_corr *= np.tri(k=-1, *df_corr.values.shape).T
    df_corr = df_corr.stack()
    df_corr = \
        df_corr.reindex(df_corr.abs().sort_values(ascending=False).index).reset_index()
    return df_corr.head(NoOfCols)
In [23]:
def createFeatureEngineeredColumns(dataset):
    dataset_tmp = pd.DataFrame()

    dataset_tmp['CountOfZeroValues'] = (dataset == 0).sum(axis=1)
    dataset_tmp['CountOfNonZeroValues'] = (dataset != 0).sum(axis=1)

    weight = ((dataset != 0).sum() / len(dataset)).values
    dataset_tmp['WeightedCount'] = (dataset * weight).sum(axis=1)

    dataset_tmp['SumOfValues'] = dataset.sum(axis=1)

    dataset_tmp['VarianceOfValues'] = dataset.var(axis=1)
    dataset_tmp['MedianOfValues'] = dataset.median(axis=1)
    dataset_tmp['MeanOfValues'] = dataset.mean(axis=1)
    dataset_tmp['StandardDeviationOfValues'] = dataset.std(axis=1)
    dataset_tmp['ModeOfValues'] = dataset.mode(axis=1)
    dataset_tmp['SkewOfValues'] = dataset.skew(axis=1)
    dataset_tmp['KurtosisOfValues'] = dataset.kurtosis(axis=1)

    dataset_tmp['MaxOfValues'] = dataset.max(axis=1)
    dataset_tmp['MinOfValues'] = dataset.min(axis=1)
    dataset_tmp['DiffOfMinMaxOfValues'] = \
        np.subtract(dataset_tmp['MaxOfValues'],
                    dataset_tmp['MinOfValues'])

    dataset_tmp['QuantilePointFiveOfValues'] = dataset[dataset
            > 0].quantile(0.5, axis=1)

    dataset = pd.concat([dataset, dataset_tmp], axis=1)

    return dataset


def getZeroStdColumns(dataset):
    columnsWithZeroStd = dataset.columns[dataset.std() == 0].tolist()
    return columnsWithZeroStd


def getUniqueValueColumns(dataset, valueToCheck=0):
    columnsWithUniqueValue = dataset.columns[dataset.nunique()
            == valueToCheck].tolist()
    return columnsWithUniqueValue
In [24]:
def getScaledDataset(dataset, scaleType='StandardScaler'):
    if scaleType == 'StandardScaler':
        scaler = StandardScaler()
        scaled_tmp = scaler.fit_transform(dataset)
    elif scaleType == 'MinMaxScaler':
        scaler = MinMaxScaler()
        scaled_tmp = scaler.fit_transform(dataset)
    elif scaleType == 'RobustScaler':
        scaler = RobustScaler()
        scaled_tmp = scaler.fit_transform(dataset)
    elif scaleType == 'MaxAbsScaler':
        scaler = MaxAbsScaler()
        scaled_tmp = scaler.fit_transform(dataset)
    elif scaleType == 'Normalizer':
        scaler = Normalizer()
        scaled_tmp = scaler.fit_transform(dataset)

    if scaler is None:
        return [dataset, _]
    else:
        dataset = pd.DataFrame(scaled_tmp)
        return [dataset, scaler]
In [25]:
def plot_countplot(x, title='', xtitle=''):
    ncount = len(x)
    ax = sns.countplot(x=x)
    plt.title(title, fontsize=18)
    plt.xlabel(xtitle, fontsize=14)

    legend_labels = x.unique()
    plt.legend(legend_labels, ncol=1, loc='best')

    # Make twin axis
    ax2 = ax.twinx()

    # Switch so count axis is on right, frequency on left
    ax2.yaxis.tick_left()
    ax.yaxis.tick_right()

    # Also switch the labels over
    ax2.yaxis.set_label_position('left')
    ax.yaxis.set_label_position('right')

    ax2.set_ylabel('Frequency [%]', fontsize=14)
    ax.set_ylabel('Count', fontsize=14)

    for p in ax.patches:
        x = p.get_bbox().get_points()[:, 0]
        y = p.get_bbox().get_points()[1, 1]
        ax.annotate('{:.1f}%'.format(100. * y / ncount), (x.mean(), y),
                    ha='center', va='bottom')  # set the alignment of the text

    # Use a LinearLocator to ensure the correct number of ticks
    ax.yaxis.set_major_locator(ticker.LinearLocator(11))

    # Fix the frequency range to 0-100
    ax2.set_ylim(0, 100)
    ax.set_ylim(0, ncount)

    # And use a MultipleLocator to ensure a tick spacing of 10
    ax2.yaxis.set_major_locator(ticker.MultipleLocator(10))

    # Need to turn the grid on ax2 off, otherwise the gridlines end up on top of the bars
    ax2.grid(None)


def plot_valuecount_pieplot(x, title=''):
    x_value_count = x.value_counts()
    x_value_index = x_value_count.index
    pieplot = plt.pie(x_value_count, labels=x_value_index,
                      autopct='%1.1f%%', shadow=True, startangle=195)
    pieplot = plt.title(title, fontsize=18)
    pieplot = plt.axis('equal')
    plt.show()


def plot_boxplot(x, y, title=''):
    boxplot = sns.boxplot(x=x, y=y, palette=belize_light_flavor);
    boxplot = plt.title(title, fontsize=18)
    plt.xticks(rotation=90)
    plt.show()


def plot_distplot(dataset):
    import matplotlib.colors as mcolors

    colors = mcolors.TABLEAU_COLORS

    dataset_fordist = dataset.select_dtypes([np.int, np.float])
    number_of_subplots = len(dataset_fordist.columns)
    number_of_columns = 3

    number_of_rows = number_of_subplots // number_of_columns
    number_of_rows += number_of_subplots % number_of_columns

    postion = range(1, number_of_subplots + 1)

    fig = plt.figure(1)
    for k in range(number_of_subplots):
        ax = fig.add_subplot(number_of_rows, number_of_columns,
                             postion[k])
        sns.distplot(dataset_fordist.iloc[:, k],
                     color=random.choice(list(colors.keys())), ax=ax)
    fig.tight_layout()
    plt.show()
In [26]:
def getCategoricalVariableDistributionGraph(target_value, title=''):
    tmp_count = target_value.value_counts()
    figureCVDG = tools.make_subplots(rows=1, cols=2, shared_yaxes=True,
            subplot_titles=('Distribution Graph',
            'Distribution Graph - Bar'))
    figureCVDG.append_trace(go.Scatter(x=tmp_count.index, y=tmp_count,
                            mode='markers+lines', connectgaps=True), 1,
                            1)
    figureCVDG.append_trace(go.Bar(x=tmp_count.index, y=tmp_count), 1,
                            2)
    figureCVDG['layout'].update(title=title,
                                titlefont=dict(family='Arial',
                                size=36), paper_bgcolor='#ffffcf',
                                plot_bgcolor='#ffffcf')
    py.iplot(figureCVDG)


def getPlotlyLayout(title='', xtitle='', ytitle=''):
    layout = go.Layout(
        title=title,
        showlegend=True,
        hovermode='closest',
        paper_bgcolor='#ffffcf',
        plot_bgcolor='#ffffcf',
        titlefont=dict(family='Arial', size=36),
        xaxis=dict(title=xtitle, titlefont=dict(family='Arial',
                   size=18), tickfont=dict(family='Arial', size=14)),
        yaxis=dict(title=ytitle, titlefont=dict(family='Arial',
                   size=18), tickfont=dict(family='Arial', size=14)),
        )

    return layout
In [27]:
class PseudoLabeler(BaseEstimator, ClassifierMixin):
    '''
    Sci-kit learn wrapper for creating pseudo-lebeled estimators.
    '''
    
    def __init__(self, model, unlabled_data, features, target, sample_rate=0.2, seed=42):
        '''
        @sample_rate - percent of samples used as pseudo-labelled data
                       from the unlabled dataset
        '''
        assert sample_rate <= 1.0, 'Sample_rate should be between 0.0 and 1.0.'
        
        self.sample_rate = sample_rate
        self.seed = seed
        self.model = model
        self.model.seed = seed
        
        self.unlabled_data = unlabled_data
        self.features = features
        self.target = target
        
    def get_params(self, deep=True):
        return {
            "sample_rate": self.sample_rate,
            "seed": self.seed,
            "model": self.model,
            "unlabled_data": self.unlabled_data,
            "features": self.features,
            "target": self.target
        }

    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            setattr(self, parameter, value)
        return self

        
    def fit(self, X, y):
        '''
        Fit the data using pseudo labeling.
        '''

        augemented_train = self.__create_augmented_train(X, y)
        self.model.fit(
            augemented_train[self.features],
            augemented_train[self.target]
        )
        
        return self


    def __create_augmented_train(self, X, y):
        '''
        Create and return the augmented_train set that consists
        of pseudo-labeled and labeled data.
        '''        
        num_of_samples = int(len(self.unlabled_data) * self.sample_rate)
        
        # Train the model and creat the pseudo-labels
        self.model.fit(X, y)
        pseudo_labels = self.model.predict(self.unlabled_data[self.features])
        
        # Add the pseudo-labels to the test set
        pseudo_data = self.unlabled_data.copy(deep=True)
        pseudo_data[self.target] = pseudo_labels
        
        # Take a subset of the test set with pseudo-labels and append in onto
        # the training set
        sampled_pseudo_data = pseudo_data.sample(n=num_of_samples)
        temp_train = pd.concat([X, y], axis=1)
        augemented_train = pd.concat([sampled_pseudo_data, temp_train])

        return shuffle(augemented_train)
        
    def predict(self, X):
        '''
        Returns the predicted values.
        '''
        return self.model.predict(X)
    
    def predict_proba(self, X):
        '''
        Returns the proba.
        '''
        return self.model.predict_proba(X)
    
    def get_model_name(self):
        return self.model.__class__.__name__
In [28]:
def convertIntFloatToInt(dictObj):
    for (k, v) in dictObj.items():
        if float('Inf') == v:
            pass
        elif int(v) == v and isinstance(v, float):
            dictObj[k] = int(v)
    return dictObj
In [29]:
def attackTypeNumConverter(attack_type):
    if attack_type == 'normal':
        return 0
    else:
        return 1


def attackTypeConverter(attack_type):
    if attack_type == 'normal':
        return 'Normal'
    else:
        return 'Attack'
In [30]:
dos_list = [
    'back',
    'land',
    'neptune',
    'pod',
    'smurf',
    'teardrop',
    'apache2',
    'udpstorm',
    'processtable',
    'worm',
    ]
probe_list = [
    'satan',
    'ipsweep',
    'nmap',
    'portsweep',
    'mscan',
    'saint',
    ]
r2l_list = [
    'guess_password',
    'ftp_write',
    'imap',
    'phf',
    'multihop',
    'warezmaster',
    'warezclient',
    'spy',
    'xlock',
    'xsnoop',
    'snmpguess',
    'snmpgetattack',
    'httptunnel',
    'sendmail',
    'named',
    ]
u2r_list = [
    'buffer_overflow',
    'loadmodule',
    'rootkit',
    'perl',
    'sqlattack',
    'xterm',
    'ps',
    ]


def attackTypeMultiNumConverter(attack_type_value):
    if attack_type_value in dos_list:
        return 1
    elif attack_type_value in probe_list:
        return 2
    elif attack_type_value in r2l_list:
        return 3
    elif attack_type_value in u2r_list:
        return 4
    else:
        return 0


def attackTypeMultiConverter(attack_type_value):
    if attack_type_value in dos_list:
        return 'DoS'
    elif attack_type_value in probe_list:
        return 'Probe'
    elif attack_type_value in r2l_list:
        return 'R2L'
    elif attack_type_value in u2r_list:
        return 'U2R'
    else:
        return 'Normal'

Load data and perform analysis

In [31]:
root_dir = ''

try:
    from google.colab import drive
    drive.mount('/content/gdrive', force_remount=True)
    root_dir = '/content/gdrive/My Drive/Colab Notebooks/Telecom Network Anomaly Detection/'
    
    !ls '/content/gdrive/My Drive/Colab Notebooks/Telecom Network Anomaly Detection'
except:
    print('No GOOGLE DRIVE connection. Using local dataset(s).')
No GOOGLE DRIVE connection. Using local dataset(s).
In [32]:
columns_name = [
    'duration',
    'protocol_type',
    'service',
    'flag',
    'src_bytes',
    'dst_bytes',
    'land',
    'wrong_fragment',
    'urgent',
    'hot',
    'num_failed_logins',
    'logged_in',
    'num_compromised',
    'root_shell',
    'su_attempted',
    'num_root',
    'num_file_creations',
    'num_shells',
    'num_access_files',
    'num_outbound_cmds',
    'is_host_login',
    'is_guest_login',
    'count',
    'srv_count',
    'serror_rate',
    'srv_serror_rate',
    'rerror_rate',
    'srv_rerror_rate',
    'same_srv_rate',
    'diff_srv_rate',
    'srv_diff_host_rate',
    'dst_host_count',
    'dst_host_srv_count',
    'dst_host_same_srv_count',
    'dst_host_diff_srv_count',
    'dst_host_same_src_port_rate',
    'dst_host_srv_diff_host_rate',
    'dst_host_serror_rate',
    'dst_host_srv_serror_rate',
    'dst_host_rerror_rate',
    'dst_host_srv_rerror_rate',
    'attack_type',
    'last_flag',
    ]
In [33]:
(
    dataset_network_train,
    df_train_shape,
    df_train_dtypes,
    df_train_describe,
    df_train_unique,
    df_train_missing,
    df_train_corr,
    ) = getDatasetInformation(root_dir + 'Train.txt', False)

(
    dataset_network_test,
    df_test_shape,
    df_test_dtypes,
    df_test_describe,
    df_test_unique,
    df_test_missing,
    df_test_corr,
    ) = getDatasetInformation(root_dir + 'Test.txt', False)
In [34]:
df_train_shape
Out[34]:
index Total
0 No of Rows 125973
1 No of Columns 43
In [35]:
df_test_shape
Out[35]:
index Total
0 No of Rows 22544
1 No of Columns 43

Key Observations

  • Train.txt Dataset comprises of 125973 observations and 43 characteristics.
  • Test.txt Dataset comprises of 22544 observations and 43 characteristics.
  • Out of which two are dependent variable (attack-type, last_flag) and rest 41 are independent variables.
In [36]:
dataset_network_train_rows = dataset_network_train.shape[0]
dataset_network_test_rows = dataset_network_test.shape[0]
In [37]:
dataset_network_train.head()
Out[37]:
duration protocol_type service flag src_bytes dst_bytes land wrong_fragment urgent hot num_failed_logins logged_in num_compromised root_shell su_attempted num_root num_file_creations num_shells num_access_files num_outbound_cmds is_host_login is_guest_login count srv_count serror_rate srv_serror_rate rerror_rate srv_rerror_rate same_srv_rate diff_srv_rate srv_diff_host_rate dst_host_count dst_host_srv_count dst_host_same_srv_count dst_host_diff_srv_count dst_host_same_src_port_rate dst_host_srv_diff_host_rate dst_host_serror_rate dst_host_srv_serror_rate dst_host_rerror_rate dst_host_srv_rerror_rate attack_type last_flag
0 0 tcp ftp_data SF 491 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0.0 0.0 0.0 0.0 1.00 0.00 0.00 150 25 0.17 0.03 0.17 0.00 0.00 0.00 0.05 0.00 normal 20
1 0 udp other SF 146 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 13 1 0.0 0.0 0.0 0.0 0.08 0.15 0.00 255 1 0.00 0.60 0.88 0.00 0.00 0.00 0.00 0.00 normal 15
2 0 tcp private S0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 123 6 1.0 1.0 0.0 0.0 0.05 0.07 0.00 255 26 0.10 0.05 0.00 0.00 1.00 1.00 0.00 0.00 neptune 19
3 0 tcp http SF 232 8153 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 5 5 0.2 0.2 0.0 0.0 1.00 0.00 0.00 30 255 1.00 0.00 0.03 0.04 0.03 0.01 0.00 0.01 normal 21
4 0 tcp http SF 199 420 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 30 32 0.0 0.0 0.0 0.0 1.00 0.00 0.09 255 255 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 normal 21
In [38]:
dataset_network_test.head()
Out[38]:
duration protocol_type service flag src_bytes dst_bytes land wrong_fragment urgent hot num_failed_logins logged_in num_compromised root_shell su_attempted num_root num_file_creations num_shells num_access_files num_outbound_cmds is_host_login is_guest_login count srv_count serror_rate srv_serror_rate rerror_rate srv_rerror_rate same_srv_rate diff_srv_rate srv_diff_host_rate dst_host_count dst_host_srv_count dst_host_same_srv_count dst_host_diff_srv_count dst_host_same_src_port_rate dst_host_srv_diff_host_rate dst_host_serror_rate dst_host_srv_serror_rate dst_host_rerror_rate dst_host_srv_rerror_rate attack_type last_flag
0 0 tcp private REJ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 229 10 0.0 0.00 1.0 1.0 0.04 0.06 0.00 255 10 0.04 0.06 0.00 0.00 0.0 0.0 1.00 1.00 neptune 21
1 0 tcp private REJ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 136 1 0.0 0.00 1.0 1.0 0.01 0.06 0.00 255 1 0.00 0.06 0.00 0.00 0.0 0.0 1.00 1.00 neptune 21
2 2 tcp ftp_data SF 12983 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0.0 0.00 0.0 0.0 1.00 0.00 0.00 134 86 0.61 0.04 0.61 0.02 0.0 0.0 0.00 0.00 normal 21
3 0 icmp eco_i SF 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 65 0.0 0.00 0.0 0.0 1.00 0.00 1.00 3 57 1.00 0.00 1.00 0.28 0.0 0.0 0.00 0.00 saint 15
4 1 tcp telnet RSTO 0 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 8 0.0 0.12 1.0 0.5 1.00 0.00 0.75 29 86 0.31 0.17 0.03 0.02 0.0 0.0 0.83 0.71 mscan 11
In [39]:
df_train_dtypes
Out[39]:
Column Names Column Data Types
0 duration int64
1 protocol_type object
2 service object
3 flag object
4 src_bytes int64
5 dst_bytes int64
6 land int64
7 wrong_fragment int64
8 urgent int64
9 hot int64
10 num_failed_logins int64
11 logged_in int64
12 num_compromised int64
13 root_shell int64
14 su_attempted int64
15 num_root int64
16 num_file_creations int64
17 num_shells int64
18 num_access_files int64
19 num_outbound_cmds int64
20 is_host_login int64
21 is_guest_login int64
22 count int64
23 srv_count int64
24 serror_rate float64
25 srv_serror_rate float64
26 rerror_rate float64
27 srv_rerror_rate float64
28 same_srv_rate float64
29 diff_srv_rate float64
30 srv_diff_host_rate float64
31 dst_host_count int64
32 dst_host_srv_count int64
33 dst_host_same_srv_count float64
34 dst_host_diff_srv_count float64
35 dst_host_same_src_port_rate float64
36 dst_host_srv_diff_host_rate float64
37 dst_host_serror_rate float64
38 dst_host_srv_serror_rate float64
39 dst_host_rerror_rate float64
40 dst_host_srv_rerror_rate float64
41 attack_type object
42 last_flag int64
In [40]:
df_test_dtypes
Out[40]:
Column Names Column Data Types
0 duration int64
1 protocol_type object
2 service object
3 flag object
4 src_bytes int64
5 dst_bytes int64
6 land int64
7 wrong_fragment int64
8 urgent int64
9 hot int64
10 num_failed_logins int64
11 logged_in int64
12 num_compromised int64
13 root_shell int64
14 su_attempted int64
15 num_root int64
16 num_file_creations int64
17 num_shells int64
18 num_access_files int64
19 num_outbound_cmds int64
20 is_host_login int64
21 is_guest_login int64
22 count int64
23 srv_count int64
24 serror_rate float64
25 srv_serror_rate float64
26 rerror_rate float64
27 srv_rerror_rate float64
28 same_srv_rate float64
29 diff_srv_rate float64
30 srv_diff_host_rate float64
31 dst_host_count int64
32 dst_host_srv_count int64
33 dst_host_same_srv_count float64
34 dst_host_diff_srv_count float64
35 dst_host_same_src_port_rate float64
36 dst_host_srv_diff_host_rate float64
37 dst_host_serror_rate float64
38 dst_host_srv_serror_rate float64
39 dst_host_rerror_rate float64
40 dst_host_srv_rerror_rate float64
41 attack_type object
42 last_flag int64
In [41]:
df_train_describe
Out[41]:
count mean std min 25% 50% 75% max
duration 125973.0 287.144650 2.604515e+03 0.0 0.00 0.00 0.00 4.290800e+04
src_bytes 125973.0 45566.743000 5.870331e+06 0.0 0.00 44.00 276.00 1.379964e+09
dst_bytes 125973.0 19779.114421 4.021269e+06 0.0 0.00 0.00 516.00 1.309937e+09
land 125973.0 0.000198 1.408607e-02 0.0 0.00 0.00 0.00 1.000000e+00
wrong_fragment 125973.0 0.022687 2.535300e-01 0.0 0.00 0.00 0.00 3.000000e+00
urgent 125973.0 0.000111 1.436603e-02 0.0 0.00 0.00 0.00 3.000000e+00
hot 125973.0 0.204409 2.149968e+00 0.0 0.00 0.00 0.00 7.700000e+01
num_failed_logins 125973.0 0.001222 4.523914e-02 0.0 0.00 0.00 0.00 5.000000e+00
logged_in 125973.0 0.395736 4.890101e-01 0.0 0.00 0.00 1.00 1.000000e+00
num_compromised 125973.0 0.279250 2.394204e+01 0.0 0.00 0.00 0.00 7.479000e+03
root_shell 125973.0 0.001342 3.660284e-02 0.0 0.00 0.00 0.00 1.000000e+00
su_attempted 125973.0 0.001103 4.515438e-02 0.0 0.00 0.00 0.00 2.000000e+00
num_root 125973.0 0.302192 2.439962e+01 0.0 0.00 0.00 0.00 7.468000e+03
num_file_creations 125973.0 0.012669 4.839351e-01 0.0 0.00 0.00 0.00 4.300000e+01
num_shells 125973.0 0.000413 2.218113e-02 0.0 0.00 0.00 0.00 2.000000e+00
num_access_files 125973.0 0.004096 9.936956e-02 0.0 0.00 0.00 0.00 9.000000e+00
num_outbound_cmds 125973.0 0.000000 0.000000e+00 0.0 0.00 0.00 0.00 0.000000e+00
is_host_login 125973.0 0.000008 2.817483e-03 0.0 0.00 0.00 0.00 1.000000e+00
is_guest_login 125973.0 0.009423 9.661233e-02 0.0 0.00 0.00 0.00 1.000000e+00
count 125973.0 84.107555 1.145086e+02 0.0 2.00 14.00 143.00 5.110000e+02
srv_count 125973.0 27.737888 7.263584e+01 0.0 2.00 8.00 18.00 5.110000e+02
serror_rate 125973.0 0.284485 4.464556e-01 0.0 0.00 0.00 1.00 1.000000e+00
srv_serror_rate 125973.0 0.282485 4.470225e-01 0.0 0.00 0.00 1.00 1.000000e+00
rerror_rate 125973.0 0.119958 3.204355e-01 0.0 0.00 0.00 0.00 1.000000e+00
srv_rerror_rate 125973.0 0.121183 3.236472e-01 0.0 0.00 0.00 0.00 1.000000e+00
same_srv_rate 125973.0 0.660928 4.396229e-01 0.0 0.09 1.00 1.00 1.000000e+00
diff_srv_rate 125973.0 0.063053 1.803144e-01 0.0 0.00 0.00 0.06 1.000000e+00
srv_diff_host_rate 125973.0 0.097322 2.598305e-01 0.0 0.00 0.00 0.00 1.000000e+00
dst_host_count 125973.0 182.148945 9.920621e+01 0.0 82.00 255.00 255.00 2.550000e+02
dst_host_srv_count 125973.0 115.653005 1.107027e+02 0.0 10.00 63.00 255.00 2.550000e+02
dst_host_same_srv_count 125973.0 0.521242 4.489494e-01 0.0 0.05 0.51 1.00 1.000000e+00
dst_host_diff_srv_count 125973.0 0.082951 1.889218e-01 0.0 0.00 0.02 0.07 1.000000e+00
dst_host_same_src_port_rate 125973.0 0.148379 3.089971e-01 0.0 0.00 0.00 0.06 1.000000e+00
dst_host_srv_diff_host_rate 125973.0 0.032542 1.125638e-01 0.0 0.00 0.00 0.02 1.000000e+00
dst_host_serror_rate 125973.0 0.284452 4.447841e-01 0.0 0.00 0.00 1.00 1.000000e+00
dst_host_srv_serror_rate 125973.0 0.278485 4.456691e-01 0.0 0.00 0.00 1.00 1.000000e+00
dst_host_rerror_rate 125973.0 0.118832 3.065575e-01 0.0 0.00 0.00 0.00 1.000000e+00
dst_host_srv_rerror_rate 125973.0 0.120240 3.194594e-01 0.0 0.00 0.00 0.00 1.000000e+00
last_flag 125973.0 19.504060 2.291503e+00 0.0 18.00 20.00 21.00 2.100000e+01
In [42]:
df_test_describe
Out[42]:
count mean std min 25% 50% 75% max
duration 22544.0 218.859076 1407.176612 0.0 0.00 0.00 0.0000 57715.0
src_bytes 22544.0 10395.450231 472786.431088 0.0 0.00 54.00 287.0000 62825648.0
dst_bytes 22544.0 2056.018808 21219.297609 0.0 0.00 46.00 601.0000 1345927.0
land 22544.0 0.000311 0.017619 0.0 0.00 0.00 0.0000 1.0
wrong_fragment 22544.0 0.008428 0.142599 0.0 0.00 0.00 0.0000 3.0
urgent 22544.0 0.000710 0.036473 0.0 0.00 0.00 0.0000 3.0
hot 22544.0 0.105394 0.928428 0.0 0.00 0.00 0.0000 101.0
num_failed_logins 22544.0 0.021647 0.150328 0.0 0.00 0.00 0.0000 4.0
logged_in 22544.0 0.442202 0.496659 0.0 0.00 0.00 1.0000 1.0
num_compromised 22544.0 0.119899 7.269597 0.0 0.00 0.00 0.0000 796.0
root_shell 22544.0 0.002440 0.049334 0.0 0.00 0.00 0.0000 1.0
su_attempted 22544.0 0.000266 0.021060 0.0 0.00 0.00 0.0000 2.0
num_root 22544.0 0.114665 8.041614 0.0 0.00 0.00 0.0000 878.0
num_file_creations 22544.0 0.008738 0.676842 0.0 0.00 0.00 0.0000 100.0
num_shells 22544.0 0.001153 0.048014 0.0 0.00 0.00 0.0000 5.0
num_access_files 22544.0 0.003549 0.067829 0.0 0.00 0.00 0.0000 4.0
num_outbound_cmds 22544.0 0.000000 0.000000 0.0 0.00 0.00 0.0000 0.0
is_host_login 22544.0 0.000488 0.022084 0.0 0.00 0.00 0.0000 1.0
is_guest_login 22544.0 0.028433 0.166211 0.0 0.00 0.00 0.0000 1.0
count 22544.0 79.028345 128.539248 0.0 1.00 8.00 123.2500 511.0
srv_count 22544.0 31.124379 89.062532 0.0 1.00 6.00 16.0000 511.0
serror_rate 22544.0 0.102924 0.295367 0.0 0.00 0.00 0.0000 1.0
srv_serror_rate 22544.0 0.103635 0.298332 0.0 0.00 0.00 0.0000 1.0
rerror_rate 22544.0 0.238463 0.416118 0.0 0.00 0.00 0.2500 1.0
srv_rerror_rate 22544.0 0.235179 0.416215 0.0 0.00 0.00 0.0725 1.0
same_srv_rate 22544.0 0.740345 0.412496 0.0 0.25 1.00 1.0000 1.0
diff_srv_rate 22544.0 0.094074 0.259138 0.0 0.00 0.00 0.0600 1.0
srv_diff_host_rate 22544.0 0.098110 0.253545 0.0 0.00 0.00 0.0000 1.0
dst_host_count 22544.0 193.869411 94.035663 0.0 121.00 255.00 255.0000 255.0
dst_host_srv_count 22544.0 140.750532 111.783972 0.0 15.00 168.00 255.0000 255.0
dst_host_same_srv_count 22544.0 0.608722 0.435688 0.0 0.07 0.92 1.0000 1.0
dst_host_diff_srv_count 22544.0 0.090540 0.220717 0.0 0.00 0.01 0.0600 1.0
dst_host_same_src_port_rate 22544.0 0.132261 0.306268 0.0 0.00 0.00 0.0300 1.0
dst_host_srv_diff_host_rate 22544.0 0.019638 0.085394 0.0 0.00 0.00 0.0100 1.0
dst_host_serror_rate 22544.0 0.097814 0.273139 0.0 0.00 0.00 0.0000 1.0
dst_host_srv_serror_rate 22544.0 0.099426 0.281866 0.0 0.00 0.00 0.0000 1.0
dst_host_rerror_rate 22544.0 0.233385 0.387229 0.0 0.00 0.00 0.3600 1.0
dst_host_srv_rerror_rate 22544.0 0.226683 0.400875 0.0 0.00 0.00 0.1700 1.0
last_flag 22544.0 18.017965 4.270361 0.0 17.00 20.00 21.0000 21.0

Key Observations

  • Here as you can notice mean value is very higher than median value of each column which is represented by 50% (50th percentile) in index column.
  • Thus observations of describe() method suggests that there are extreme values-Outliers in our data set.
In [43]:
df_train_unique
Out[43]:
Column Name Unique Value(s) Count
0 duration 2981
1 protocol_type 3
2 service 70
3 flag 11
4 src_bytes 3341
5 dst_bytes 9326
6 land 2
7 wrong_fragment 3
8 urgent 4
9 hot 28
10 num_failed_logins 6
11 logged_in 2
12 num_compromised 88
13 root_shell 2
14 su_attempted 3
15 num_root 82
16 num_file_creations 35
17 num_shells 3
18 num_access_files 10
19 num_outbound_cmds 1
20 is_host_login 2
21 is_guest_login 2
22 count 512
23 srv_count 509
24 serror_rate 89
25 srv_serror_rate 86
26 rerror_rate 82
27 srv_rerror_rate 62
28 same_srv_rate 101
29 diff_srv_rate 95
30 srv_diff_host_rate 60
31 dst_host_count 256
32 dst_host_srv_count 256
33 dst_host_same_srv_count 101
34 dst_host_diff_srv_count 101
35 dst_host_same_src_port_rate 101
36 dst_host_srv_diff_host_rate 75
37 dst_host_serror_rate 101
38 dst_host_srv_serror_rate 100
39 dst_host_rerror_rate 101
40 dst_host_srv_rerror_rate 101
41 attack_type 23
42 last_flag 22
In [44]:
df_test_unique
Out[44]:
Column Name Unique Value(s) Count
0 duration 624
1 protocol_type 3
2 service 64
3 flag 11
4 src_bytes 1149
5 dst_bytes 3650
6 land 2
7 wrong_fragment 3
8 urgent 4
9 hot 16
10 num_failed_logins 5
11 logged_in 2
12 num_compromised 23
13 root_shell 2
14 su_attempted 3
15 num_root 20
16 num_file_creations 9
17 num_shells 4
18 num_access_files 5
19 num_outbound_cmds 1
20 is_host_login 2
21 is_guest_login 2
22 count 495
23 srv_count 457
24 serror_rate 88
25 srv_serror_rate 82
26 rerror_rate 90
27 srv_rerror_rate 93
28 same_srv_rate 75
29 diff_srv_rate 99
30 srv_diff_host_rate 84
31 dst_host_count 256
32 dst_host_srv_count 256
33 dst_host_same_srv_count 101
34 dst_host_diff_srv_count 101
35 dst_host_same_src_port_rate 101
36 dst_host_srv_diff_host_rate 58
37 dst_host_serror_rate 99
38 dst_host_srv_serror_rate 101
39 dst_host_rerror_rate 101
40 dst_host_srv_rerror_rate 100
41 attack_type 38
42 last_flag 22
In [45]:
df_train_missing
Out[45]:
Column Names NULL value count per Column
0 duration 0
32 dst_host_srv_count 0
24 serror_rate 0
25 srv_serror_rate 0
26 rerror_rate 0
27 srv_rerror_rate 0
28 same_srv_rate 0
29 diff_srv_rate 0
30 srv_diff_host_rate 0
31 dst_host_count 0
33 dst_host_same_srv_count 0
22 count 0
34 dst_host_diff_srv_count 0
35 dst_host_same_src_port_rate 0
36 dst_host_srv_diff_host_rate 0
37 dst_host_serror_rate 0
38 dst_host_srv_serror_rate 0
39 dst_host_rerror_rate 0
40 dst_host_srv_rerror_rate 0
41 attack_type 0
23 srv_count 0
21 is_guest_login 0
1 protocol_type 0
10 num_failed_logins 0
2 service 0
3 flag 0
4 src_bytes 0
5 dst_bytes 0
6 land 0
7 wrong_fragment 0
8 urgent 0
9 hot 0
11 logged_in 0
20 is_host_login 0
12 num_compromised 0
13 root_shell 0
14 su_attempted 0
15 num_root 0
16 num_file_creations 0
17 num_shells 0
18 num_access_files 0
19 num_outbound_cmds 0
42 last_flag 0
In [46]:
msno.matrix(dataset_network_train, color=(33 / 255, 102 / 255, 172 / 255));
In [47]:
df_test_missing
Out[47]:
Column Names NULL value count per Column
0 duration 0
32 dst_host_srv_count 0
24 serror_rate 0
25 srv_serror_rate 0
26 rerror_rate 0
27 srv_rerror_rate 0
28 same_srv_rate 0
29 diff_srv_rate 0
30 srv_diff_host_rate 0
31 dst_host_count 0
33 dst_host_same_srv_count 0
22 count 0
34 dst_host_diff_srv_count 0
35 dst_host_same_src_port_rate 0
36 dst_host_srv_diff_host_rate 0
37 dst_host_serror_rate 0
38 dst_host_srv_serror_rate 0
39 dst_host_rerror_rate 0
40 dst_host_srv_rerror_rate 0
41 attack_type 0
23 srv_count 0
21 is_guest_login 0
1 protocol_type 0
10 num_failed_logins 0
2 service 0
3 flag 0
4 src_bytes 0
5 dst_bytes 0
6 land 0
7 wrong_fragment 0
8 urgent 0
9 hot 0
11 logged_in 0
20 is_host_login 0
12 num_compromised 0
13 root_shell 0
14 su_attempted 0
15 num_root 0
16 num_file_creations 0
17 num_shells 0
18 num_access_files 0
19 num_outbound_cmds 0
42 last_flag 0
In [48]:
msno.matrix(dataset_network_test, color=(33 / 255, 102 / 255, 172 / 255));

Key Observations

  • Data has mostly float and integer values except three independent variables (protocol_type, service, and flag) which are object type (categorical)
  • No variable column has null/missing values.

Perform exploratory data analysis (EDA)

Understanding Distribution of Feature(s)

In [49]:
plot_distplot(dataset_network_train)
In [50]:
plot_distplot(dataset_network_test)

Understanding Target variable (attack_type)

In [51]:
dataset_network_train.attack_type.unique()
Out[51]:
array(['normal', 'neptune', 'warezclient', 'ipsweep', 'portsweep',
       'teardrop', 'nmap', 'satan', 'smurf', 'pod', 'back',
       'guess_passwd', 'ftp_write', 'multihop', 'rootkit',
       'buffer_overflow', 'imap', 'warezmaster', 'phf', 'land',
       'loadmodule', 'spy', 'perl'], dtype=object)
In [52]:
dataset_network_train.attack_type.value_counts()
Out[52]:
normal             67343
neptune            41214
satan               3633
ipsweep             3599
portsweep           2931
smurf               2646
nmap                1493
back                 956
teardrop             892
warezclient          890
pod                  201
guess_passwd          53
buffer_overflow       30
warezmaster           20
land                  18
imap                  11
rootkit               10
loadmodule             9
ftp_write              8
multihop               7
phf                    4
perl                   3
spy                    2
Name: attack_type, dtype: int64

Key Observations

  • The dataset(s) has observations which are very less between 1 to 53 observations only.
  • The dataset(s) are highly unbalanced.
In [53]:
plot_valuecount_pieplot(dataset_network_train.attack_type, 'Attack Type (train feature) Distribution Percentage')
In [54]:
dataset_network_test.attack_type.unique()
Out[54]:
array(['neptune', 'normal', 'saint', 'mscan', 'guess_passwd', 'smurf',
       'apache2', 'satan', 'buffer_overflow', 'back', 'warezmaster',
       'snmpgetattack', 'processtable', 'pod', 'httptunnel', 'nmap', 'ps',
       'snmpguess', 'ipsweep', 'mailbomb', 'portsweep', 'multihop',
       'named', 'sendmail', 'loadmodule', 'xterm', 'worm', 'teardrop',
       'rootkit', 'xlock', 'perl', 'land', 'xsnoop', 'sqlattack',
       'ftp_write', 'imap', 'udpstorm', 'phf'], dtype=object)
In [55]:
dataset_network_test.attack_type.value_counts()
Out[55]:
normal             9711
neptune            4657
guess_passwd       1231
mscan               996
warezmaster         944
apache2             737
satan               735
processtable        685
smurf               665
back                359
snmpguess           331
saint               319
mailbomb            293
snmpgetattack       178
portsweep           157
ipsweep             141
httptunnel          133
nmap                 73
pod                  41
buffer_overflow      20
multihop             18
named                17
ps                   15
sendmail             14
xterm                13
rootkit              13
teardrop             12
xlock                 9
land                  7
xsnoop                4
ftp_write             3
perl                  2
loadmodule            2
phf                   2
worm                  2
sqlattack             2
udpstorm              2
imap                  1
Name: attack_type, dtype: int64

Key Observations

  • The dataset(s) has observations which are very less between 1 to 73 observations only.
  • The dataset(s) are highly unbalanced.
In [56]:
plot_valuecount_pieplot(dataset_network_test.attack_type, 'Attack Type (test feature) Distribution Percentage')

Key Observations

  • Train.txt data's target variable "attack_type" has less number of unique values then Test.txt. Test.txt has additional unique values which is not available for training.
  • Target variable/Dependent variable is discrete and categorical in nature.

Understanding Target variable (last_flag)

In [57]:
dataset_network_train.last_flag.unique()
Out[57]:
array([20, 15, 19, 21, 18, 17, 16, 12, 14, 11,  2, 13, 10,  9,  8,  7,  3,
        5,  1,  6,  0,  4], dtype=int64)
In [58]:
dataset_network_train.last_flag.value_counts()
Out[58]:
21    62557
18    20667
20    19339
19    10284
15     3990
17     3074
16     2393
12      729
14      674
11      641
13      451
10      253
9       194
7       118
8       106
6        96
5        81
4        79
0        66
3        65
1        62
2        54
Name: last_flag, dtype: int64
In [59]:
plot_valuecount_pieplot(dataset_network_train.last_flag, 'Last Flag (train feature) Distribution Percentage')
In [60]:
dataset_network_test.last_flag.unique()
Out[60]:
array([21, 15, 11,  7,  9, 18, 14, 20, 17,  1, 19, 12, 13,  3,  8,  0, 16,
       10,  2,  5,  6,  4], dtype=int64)
In [61]:
dataset_network_test.last_flag.value_counts()
Out[61]:
21    10694
18     2967
20     1343
15     1176
17     1168
19      890
14      736
16      681
13      519
12      486
11      461
7       249
10      195
6       157
8       131
0       123
3       116
9       106
5       103
4       101
1        87
2        55
Name: last_flag, dtype: int64
In [62]:
plot_valuecount_pieplot(dataset_network_test.last_flag, 'Last Flag (test feature) Distribution Percentage')

Understanding distribution of Nominal Features

In [63]:
plt.subplot(121)
plot_valuecount_pieplot(dataset_network_train.protocol_type, 'Protocol Type (train feature) Distribution Percentage')
plt.subplot(122)
plot_valuecount_pieplot(dataset_network_test.protocol_type, 'Protocol Type (test feature) Distribution Percentage')
plt.tight_layout()
plt.show()
<Figure size 1080x864 with 0 Axes>
In [64]:
plt.subplot(121)
plot_valuecount_pieplot(dataset_network_train.service, 'Service (train feature) Distribution Percentage')
plt.subplot(122)
plot_valuecount_pieplot(dataset_network_test.service, 'Service (test feature) Distribution Percentage')
plt.tight_layout()
plt.show()
<Figure size 1080x864 with 0 Axes>
In [65]:
plt.subplot(121)
plot_valuecount_pieplot(dataset_network_train.flag, 'Flag (train feature) Distribution Percentage')
plt.subplot(122)
plot_valuecount_pieplot(dataset_network_test.flag, 'Flag (test feature) Distribution Percentage')
plt.tight_layout()
plt.show()
<Figure size 1080x864 with 0 Axes>

Understanding distribution of Binary Features

In [66]:
plt.subplot(121)
plot_countplot(dataset_network_train.land, 'Land (train feature) Distribution', 'Land values')
plt.subplot(122)
plot_countplot(dataset_network_test.land, 'Land (test feature) Distribution', 'Land values')
plt.tight_layout()
plt.show()
In [67]:
plt.subplot(121)
plot_countplot(dataset_network_train.logged_in, 'Logged In (train feature) Distribution', 'Logged In values')
plt.subplot(122)
plot_countplot(dataset_network_test.logged_in, 'Logged In (test feature) Distribution', 'Logged In values')
plt.tight_layout()
plt.show()
In [68]:
plt.subplot(121)
plot_countplot(dataset_network_train.root_shell, 'Root Shell (train feature) Distribution', 'Root Shell values')
plt.subplot(122)
plot_countplot(dataset_network_test.root_shell, 'Root Shell (test feature) Distribution', 'Root Shell values')
plt.tight_layout()
plt.show()
In [69]:
plt.subplot(121)
plot_countplot(dataset_network_train.su_attempted, 'su Attempted (train feature) Distribution', 'su Attempted values')
plt.subplot(122)
plot_countplot(dataset_network_test.su_attempted, 'su Attempted (test feature) Distribution', 'su Attempted values')
plt.tight_layout()
plt.show()
In [70]:
plt.subplot(121)
plot_countplot(dataset_network_train.is_host_login, 'Is Host Login (train feature) Distribution', 'Is Host Login values')
plt.subplot(122)
plot_countplot(dataset_network_test.is_host_login, 'Is Host Login (test feature) Distribution', 'Is Host Login values')
plt.tight_layout()
plt.show()
In [71]:
plt.subplot(121)
plot_countplot(dataset_network_train.is_guest_login, 'Is Guest Login (train feature) Distribution', 'Is Guest Login values')
plt.subplot(122)
plot_countplot(dataset_network_test.is_guest_login, 'Is Guest Login (test feature) Distribution', 'Is Guest Login values')
plt.tight_layout()
plt.show()
In [72]:
plt.subplot(211)
plot_boxplot(dataset_network_train.attack_type, 
             dataset_network_train.dst_host_count, 
             title='attack_type v/s dst_host_count (train data)'
            );
plt.subplot(211)
plot_boxplot(dataset_network_test.attack_type, 
             dataset_network_test.dst_host_count, 
             title='attack_type v/s dst_host_count (test data)'
            );
plt.show()
In [73]:
plt.subplot(211)
plot_boxplot(dataset_network_train.attack_type, 
             dataset_network_train.dst_host_srv_count, 
             title='attack_type v/s dst_host_srv_count (train data)'
            );
plt.subplot(211)
plot_boxplot(dataset_network_test.attack_type, 
             dataset_network_test.dst_host_srv_count, 
             title='attack_type v/s dst_host_srv_count (test data)'
            );
plt.show()
In [74]:
plt.subplot(211)
plot_boxplot(dataset_network_train.attack_type, 
             dataset_network_train.is_guest_login, 
             title='attack_type v/s is_guest_login (train data)'
            );
plt.subplot(211)
plot_boxplot(dataset_network_test.attack_type, 
             dataset_network_test.is_guest_login, 
             title='attack_type v/s is_guest_login (test data)'
            );
plt.show()

Feature Engineering

Correlation Matrix

In [75]:
df_corr = dataset_network_train.corr(method='spearman')
df_corr
Out[75]:
duration src_bytes dst_bytes land wrong_fragment urgent hot num_failed_logins logged_in num_compromised root_shell su_attempted num_root num_file_creations num_shells num_access_files num_outbound_cmds is_host_login is_guest_login count srv_count serror_rate srv_serror_rate rerror_rate srv_rerror_rate same_srv_rate diff_srv_rate srv_diff_host_rate dst_host_count dst_host_srv_count dst_host_same_srv_count dst_host_diff_srv_count dst_host_same_src_port_rate dst_host_srv_diff_host_rate dst_host_serror_rate dst_host_srv_serror_rate dst_host_rerror_rate dst_host_srv_rerror_rate last_flag
duration 1.000000 0.226289 0.148983 -0.004136 -0.027429 0.029670 0.229319 0.058024 0.119651 0.080430 0.074821 0.075079 0.042033 0.086589 0.009893 0.048248 NaN 0.009817 0.305264 -0.323663 -0.319135 -0.184564 -0.181817 0.051243 0.045693 0.168681 -0.139716 0.015576 -0.066844 -0.156473 -0.140483 0.196320 0.182132 -0.026560 -0.155824 -0.153078 0.065207 0.068996 -0.007767
src_bytes 0.226289 1.000000 0.700442 -0.015303 0.015848 0.007789 0.204355 0.012740 0.776844 0.154821 0.040084 0.033547 0.080847 0.072759 0.029673 0.063387 NaN 0.000955 0.132652 -0.523823 -0.053056 -0.673558 -0.652557 -0.363120 -0.344039 0.752801 -0.705392 0.289111 -0.405293 0.620720 0.617294 -0.525394 0.377040 0.343929 -0.624111 -0.608681 -0.229499 -0.259287 0.279580
dst_bytes 0.148983 0.700442 1.000000 -0.012239 -0.080977 0.013028 0.200110 0.025260 0.822190 0.171165 0.062631 0.046253 -0.012985 0.040314 -0.006556 0.068272 NaN 0.004716 0.131275 -0.439640 -0.016658 -0.536478 -0.510009 -0.299625 -0.277687 0.629973 -0.608104 0.310325 -0.342276 0.707573 0.666924 -0.624866 0.059123 0.327014 -0.519159 -0.483431 -0.218704 -0.186047 0.460242
land -0.004136 -0.015303 -0.012239 1.000000 -0.001316 -0.000119 -0.002073 -0.000439 -0.011402 -0.001431 -0.000516 -0.000355 -0.001014 -0.000673 -0.000272 -0.000766 NaN -0.000040 -0.001374 -0.017707 -0.013622 0.021598 0.022250 -0.003908 -0.005409 0.008607 -0.008086 0.023460 -0.026018 -0.017995 0.012719 -0.011336 0.022817 0.021344 0.020113 0.017276 -0.003990 -0.005984 -0.017332
wrong_fragment -0.027429 0.015848 -0.080977 -0.001316 1.000000 -0.000790 -0.013749 -0.002909 -0.075605 -0.009488 -0.003424 -0.002355 -0.006723 -0.004464 -0.001805 -0.005077 NaN -0.000263 -0.009112 0.011829 0.085125 -0.040707 -0.060279 -0.018378 -0.035871 0.047834 -0.047221 -0.027775 0.022517 -0.018833 -0.020442 0.032054 0.109508 -0.036583 -0.016905 -0.063794 0.038362 -0.039682 -0.146584
urgent 0.029670 0.007789 0.013028 -0.000119 -0.000790 1.000000 0.017955 0.060172 0.008524 0.055610 0.102325 0.111647 0.078350 0.058679 -0.000163 0.016865 NaN -0.000024 -0.000824 -0.011476 -0.011743 -0.005634 -0.005454 -0.003241 -0.003245 0.006498 -0.006571 -0.004512 -0.006790 -0.011919 -0.004589 0.002349 0.004120 -0.000385 -0.004700 -0.005772 -0.003949 -0.003590 -0.007988
hot 0.229319 0.204355 0.200110 -0.002073 -0.013749 0.017955 1.000000 0.092974 0.171878 0.514724 0.140361 0.015718 0.008543 0.052763 0.014253 0.002123 NaN 0.018784 0.665821 -0.153166 -0.143470 -0.086205 -0.082632 -0.009033 0.012295 0.106979 -0.103591 -0.028497 -0.068595 0.004939 0.063152 -0.060204 -0.001804 -0.066183 -0.063820 -0.070269 0.095387 0.095205 -0.152387
num_failed_logins 0.058024 0.012740 0.025260 -0.000439 -0.002909 0.060172 0.092974 1.000000 -0.011102 0.035221 0.026757 0.090380 0.037102 0.062769 -0.000602 0.003015 NaN -0.000088 0.002249 -0.038077 -0.039550 -0.018389 -0.018135 0.030823 0.030205 0.022804 -0.022191 -0.016621 -0.031247 -0.018930 0.007478 -0.002810 0.015013 -0.018310 0.007159 0.007902 0.029613 0.029172 -0.048949
logged_in 0.119651 0.776844 0.822190 -0.011402 -0.075605 0.008524 0.171878 -0.011102 1.000000 0.125492 0.045290 0.031150 0.088923 0.058708 0.023873 0.067158 NaN 0.003482 0.119678 -0.503868 -0.137034 -0.492653 -0.472549 -0.283275 -0.263681 0.596788 -0.588050 0.265448 -0.424506 0.642934 0.606223 -0.555504 0.144428 0.426757 -0.464252 -0.430866 -0.196843 -0.175929 0.491281
num_compromised 0.080430 0.154821 0.171165 -0.001431 -0.009488 0.055610 0.514724 0.035221 0.125492 1.000000 0.215769 0.237853 0.165064 0.111913 0.022700 0.093062 NaN 0.027991 -0.009905 -0.087423 -0.078700 -0.062378 -0.060477 -0.002037 0.027408 0.075803 -0.075177 -0.013983 -0.046542 0.007197 0.068608 -0.067647 -0.002362 -0.052963 -0.030676 -0.027964 0.125343 0.147736 -0.152731
root_shell 0.074821 0.040084 0.062631 -0.000516 -0.003424 0.102325 0.140361 0.026757 0.045290 0.215769 1.000000 0.584521 0.246675 0.134749 0.089145 0.206447 NaN -0.000103 -0.003575 -0.039055 -0.035080 -0.018925 -0.018005 -0.008922 -0.009572 0.027347 -0.026764 -0.016112 -0.029030 -0.009064 0.011467 -0.010411 0.013502 0.010014 -0.010312 -0.010236 -0.000550 0.002795 -0.030324
su_attempted 0.075079 0.033547 0.046253 -0.000355 -0.002355 0.111647 0.015718 0.090380 0.031150 0.237853 0.584521 1.000000 0.342848 0.117757 0.015821 0.278244 NaN -0.000071 -0.002459 -0.033899 -0.034855 -0.008584 -0.007853 -0.005938 -0.006827 0.018973 -0.018748 -0.013457 -0.011683 -0.019863 -0.009882 0.014021 0.001039 -0.006066 0.003470 0.004323 0.006978 0.008884 -0.023846
num_root 0.042033 0.080847 -0.012985 -0.001014 -0.006723 0.078350 0.008543 0.037102 0.088923 0.165064 0.246675 0.342848 1.000000 0.110760 0.038843 0.101065 NaN 0.039307 -0.007019 -0.080211 -0.080384 -0.042540 -0.041042 -0.025009 -0.025659 0.049556 -0.044460 -0.026586 -0.053603 -0.030721 -0.021363 0.039453 0.060567 0.034999 -0.032004 -0.023339 -0.010874 -0.013006 -0.001660
num_file_creations 0.086589 0.072759 0.040314 -0.000673 -0.004464 0.058679 0.052763 0.062769 0.058708 0.111913 0.134749 0.117757 0.110760 1.000000 0.085358 0.062008 NaN -0.000135 0.007398 -0.052070 -0.051095 -0.021730 -0.024007 -0.014897 -0.015332 0.030867 -0.028754 -0.008095 -0.021418 -0.021221 -0.010295 0.019298 0.016066 0.001808 -0.001104 -0.006529 0.001445 0.001619 -0.016350
num_shells 0.009893 0.029673 -0.006556 -0.000272 -0.001805 -0.000163 0.014253 -0.000602 0.023873 0.022700 0.089145 0.015821 0.038843 0.085358 1.000000 0.021719 NaN -0.000054 -0.001884 -0.010317 -0.007403 -0.010356 -0.011826 -0.007407 -0.007418 0.010895 -0.008736 -0.007551 -0.010266 -0.008106 -0.004130 0.003848 0.021355 0.002260 -0.006687 -0.007329 -0.000450 -0.007182 -0.004215
num_access_files 0.048248 0.063387 0.068272 -0.000766 -0.005077 0.016865 0.002123 0.003015 0.067158 0.093062 0.206447 0.278244 0.101065 0.062008 0.021719 1.000000 NaN -0.000153 -0.002269 -0.053726 -0.044124 -0.032070 -0.030483 -0.019168 -0.008358 0.039703 -0.038935 0.008526 -0.000255 0.011896 0.010824 0.002270 -0.014611 -0.012264 -0.023331 -0.023089 -0.009838 -0.007203 0.021126
num_outbound_cmds NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
is_host_login 0.009817 0.000955 0.004716 -0.000040 -0.000263 -0.000024 0.018784 -0.000088 0.003482 0.027991 -0.000103 -0.000071 0.039307 -0.000135 -0.000054 -0.000153 NaN 1.000000 -0.000275 -0.003825 -0.003914 -0.001878 -0.001818 -0.001080 -0.001082 0.002166 -0.002190 -0.001504 0.002252 -0.001381 -0.001694 0.004532 -0.002612 -0.001847 0.002546 0.002615 -0.001316 0.005779 -0.004272
is_guest_login 0.305264 0.132652 0.131275 -0.001374 -0.009112 -0.000824 0.665821 0.002249 0.119678 -0.009905 -0.003575 -0.002459 -0.007019 0.007398 -0.001884 -0.002269 NaN -0.000275 1.000000 -0.129724 -0.134285 -0.063175 -0.061101 -0.035989 -0.036208 0.070011 -0.065696 -0.048677 -0.034774 -0.034702 -0.020360 0.025117 -0.017738 -0.060950 -0.046317 -0.061750 0.009867 -0.010718 -0.059315
count -0.323663 -0.523823 -0.439640 -0.017707 0.011829 -0.011476 -0.153166 -0.038077 -0.503868 -0.087423 -0.039055 -0.033899 -0.080211 -0.052070 -0.010317 -0.053726 NaN -0.003825 -0.129724 1.000000 0.518602 0.577649 0.541777 0.071425 0.065207 -0.719825 0.616829 -0.325432 0.619602 -0.324673 -0.429355 0.361821 -0.546478 -0.531004 0.536367 0.505348 0.021829 0.023688 -0.152516
srv_count -0.319135 -0.053056 -0.016658 -0.013622 0.085125 -0.011743 -0.143470 -0.039550 -0.137034 -0.078700 -0.035080 -0.034855 -0.080384 -0.051095 -0.007403 -0.044124 NaN -0.003914 -0.134285 0.518602 1.000000 0.073237 0.108170 -0.207601 -0.207795 0.031631 -0.038901 0.235030 0.218114 0.299753 0.257425 -0.240947 -0.196667 -0.122382 0.045705 0.079506 -0.213103 -0.219695 -0.052618
serror_rate -0.184564 -0.673558 -0.536478 0.021598 -0.040707 -0.005634 -0.086205 -0.018389 -0.492653 -0.062378 -0.018925 -0.008584 -0.042540 -0.021730 -0.010356 -0.032070 NaN -0.001878 -0.063175 0.577649 0.073237 1.000000 0.973119 -0.173769 -0.179206 -0.755361 0.674216 -0.324644 0.430015 -0.523285 -0.573396 0.485428 -0.484937 -0.388509 0.935943 0.921663 -0.226750 -0.205626 -0.161657
srv_serror_rate -0.181817 -0.652557 -0.510009 0.022250 -0.060279 -0.005454 -0.082632 -0.018135 -0.472549 -0.060477 -0.018005 -0.007853 -0.041042 -0.024007 -0.011826 -0.030483 NaN -0.001818 -0.061101 0.541777 0.108170 0.973119 1.000000 -0.221613 -0.237100 -0.706146 0.624528 -0.305441 0.410975 -0.479041 -0.528008 0.438992 -0.470598 -0.368625 0.918676 0.942332 -0.275662 -0.259482 -0.137798
rerror_rate 0.051243 -0.363120 -0.299625 -0.003908 -0.018378 -0.003241 -0.009033 0.030823 -0.283275 -0.002037 -0.008922 -0.005938 -0.025009 -0.014897 -0.007407 -0.019168 NaN -0.001080 -0.035989 0.071425 -0.207601 -0.173769 -0.221613 1.000000 0.965777 -0.223062 0.234307 -0.149382 0.082848 -0.311755 -0.291812 0.287811 -0.014192 -0.078181 -0.179240 -0.223543 0.838928 0.883992 -0.150114
srv_rerror_rate 0.045693 -0.344039 -0.277687 -0.005409 -0.035871 -0.003245 0.012295 0.030205 -0.263681 0.027408 -0.009572 -0.006827 -0.025659 -0.015332 -0.007418 -0.008358 NaN -0.001082 -0.036208 0.065207 -0.207795 -0.179206 -0.237100 0.965777 1.000000 -0.209830 0.213447 -0.121466 0.075002 -0.297289 -0.275353 0.273444 -0.020199 -0.075414 -0.187131 -0.238481 0.830763 0.893986 -0.141205
same_srv_rate 0.168681 0.752801 0.629973 0.008607 0.047834 0.006498 0.106979 0.022804 0.596788 0.075803 0.027347 0.018973 0.049556 0.030867 0.010895 0.039703 NaN 0.002166 0.070011 -0.719825 0.031631 -0.755361 -0.706146 -0.223062 -0.209830 1.000000 -0.920431 0.384775 -0.540922 0.698873 0.757717 -0.650731 0.525177 0.488303 -0.716587 -0.678389 -0.144015 -0.157457 0.176583
diff_srv_rate -0.139716 -0.705392 -0.608104 -0.008086 -0.047221 -0.006571 -0.103591 -0.022191 -0.588050 -0.075177 -0.026764 -0.018748 -0.044460 -0.028754 -0.008736 -0.038935 NaN -0.002190 -0.065696 0.616829 -0.038901 0.674216 0.624528 0.234307 0.213447 -0.920431 1.000000 -0.376300 0.525512 -0.668239 -0.727460 0.646014 -0.442844 -0.481938 0.644672 0.596769 0.156418 0.159553 -0.210587
srv_diff_host_rate 0.015576 0.289111 0.310325 0.023460 -0.027775 -0.004512 -0.028497 -0.016621 0.265448 -0.013983 -0.016112 -0.013457 -0.026586 -0.008095 -0.007551 0.008526 NaN -0.001504 -0.048677 -0.325432 0.235030 -0.324644 -0.305441 -0.149382 -0.121466 0.384775 -0.376300 1.000000 -0.311145 0.396448 0.441664 -0.404185 0.159826 0.341972 -0.327256 -0.301231 -0.136421 -0.124974 0.138474
dst_host_count -0.066844 -0.405293 -0.342276 -0.026018 0.022517 -0.006790 -0.068595 -0.031247 -0.424506 -0.046542 -0.029030 -0.011683 -0.053603 -0.021418 -0.010266 -0.000255 NaN 0.002252 -0.034774 0.619602 0.218114 0.430015 0.410975 0.082848 0.075002 -0.540922 0.525512 -0.311145 1.000000 -0.350306 -0.531818 0.434858 -0.693475 -0.837959 0.422777 0.380286 0.052243 0.030199 -0.127165
dst_host_srv_count -0.156473 0.620720 0.707573 -0.017995 -0.018833 -0.011919 0.004939 -0.018930 0.642934 0.007197 -0.009064 -0.019863 -0.030721 -0.021221 -0.008106 0.011896 NaN -0.001381 -0.034702 -0.324673 0.299753 -0.523285 -0.479041 -0.311755 -0.297289 0.698873 -0.668239 0.396448 -0.350306 1.000000 0.919323 -0.840731 0.152125 0.447503 -0.528423 -0.469021 -0.265444 -0.248662 0.382018
dst_host_same_srv_count -0.140483 0.617294 0.666924 0.012719 -0.020442 -0.004589 0.063152 0.007478 0.606223 0.068608 0.011467 -0.009882 -0.021363 -0.010295 -0.004130 0.010824 NaN -0.001694 -0.020360 -0.429355 0.257425 -0.573396 -0.528008 -0.291812 -0.275353 0.757717 -0.727460 0.441664 -0.531818 0.919323 1.000000 -0.898954 0.305138 0.538861 -0.581725 -0.514917 -0.247405 -0.219657 0.270553
dst_host_diff_srv_count 0.196320 -0.525394 -0.624866 -0.011336 0.032054 0.002349 -0.060204 -0.002810 -0.555504 -0.067647 -0.010411 0.014021 0.039453 0.019298 0.003848 0.002270 NaN 0.004532 0.025117 0.361821 -0.240947 0.485428 0.438992 0.287811 0.273444 -0.650731 0.646014 -0.404185 0.434858 -0.840731 -0.898954 1.000000 -0.213435 -0.490264 0.504520 0.433630 0.267599 0.223628 -0.260194
dst_host_same_src_port_rate 0.182132 0.377040 0.059123 0.022817 0.109508 0.004120 -0.001804 0.015013 0.144428 -0.002362 0.013502 0.001039 0.060567 0.016066 0.021355 -0.014611 NaN -0.002612 -0.017738 -0.546478 -0.196667 -0.484937 -0.470598 -0.014192 -0.020199 0.525177 -0.442844 0.159826 -0.693475 0.152125 0.305138 -0.213435 1.000000 0.561120 -0.455296 -0.449602 0.039497 -0.007631 -0.123755
dst_host_srv_diff_host_rate -0.026560 0.343929 0.327014 0.021344 -0.036583 -0.000385 -0.066183 -0.018310 0.426757 -0.052963 0.010014 -0.006066 0.034999 0.001808 0.002260 -0.012264 NaN -0.001847 -0.060950 -0.531004 -0.122382 -0.388509 -0.368625 -0.078181 -0.075414 0.488303 -0.481938 0.341972 -0.837959 0.447503 0.538861 -0.490264 0.561120 1.000000 -0.384842 -0.340312 -0.062218 -0.037944 0.213489
dst_host_serror_rate -0.155824 -0.624111 -0.519159 0.020113 -0.016905 -0.004700 -0.063820 0.007159 -0.464252 -0.030676 -0.010312 0.003470 -0.032004 -0.001104 -0.006687 -0.023331 NaN 0.002546 -0.046317 0.536367 0.045705 0.935943 0.918676 -0.179240 -0.187131 -0.716587 0.644672 -0.327256 0.422777 -0.528423 -0.581725 0.504520 -0.455296 -0.384842 1.000000 0.919490 -0.195192 -0.206329 -0.175164
dst_host_srv_serror_rate -0.153078 -0.608681 -0.483431 0.017276 -0.063794 -0.005772 -0.070269 0.007902 -0.430866 -0.027964 -0.010236 0.004323 -0.023339 -0.006529 -0.007329 -0.023089 NaN 0.002615 -0.061750 0.505348 0.079506 0.921663 0.942332 -0.223543 -0.238481 -0.678389 0.596769 -0.301231 0.380286 -0.469021 -0.514917 0.433630 -0.449602 -0.340312 0.919490 1.000000 -0.271819 -0.250393 -0.116181
dst_host_rerror_rate 0.065207 -0.229499 -0.218704 -0.003990 0.038362 -0.003949 0.095387 0.029613 -0.196843 0.125343 -0.000550 0.006978 -0.010874 0.001445 -0.000450 -0.009838 NaN -0.001316 0.009867 0.021829 -0.213103 -0.226750 -0.275662 0.838928 0.830763 -0.144015 0.156418 -0.136421 0.052243 -0.265444 -0.247405 0.267599 0.039497 -0.062218 -0.195192 -0.271819 1.000000 0.880407 -0.190619
dst_host_srv_rerror_rate 0.068996 -0.259287 -0.186047 -0.005984 -0.039682 -0.003590 0.095205 0.029172 -0.175929 0.147736 0.002795 0.008884 -0.013006 0.001619 -0.007182 -0.007203 NaN 0.005779 -0.010718 0.023688 -0.219695 -0.205626 -0.259482 0.883992 0.893986 -0.157457 0.159553 -0.124974 0.030199 -0.248662 -0.219657 0.223628 -0.007631 -0.037944 -0.206329 -0.250393 0.880407 1.000000 -0.135072
last_flag -0.007767 0.279580 0.460242 -0.017332 -0.146584 -0.007988 -0.152387 -0.048949 0.491281 -0.152731 -0.030324 -0.023846 -0.001660 -0.016350 -0.004215 0.021126 NaN -0.004272 -0.059315 -0.152516 -0.052618 -0.161657 -0.137798 -0.150114 -0.141205 0.176583 -0.210587 0.138474 -0.127165 0.382018 0.270553 -0.260194 -0.123755 0.213489 -0.175164 -0.116181 -0.190619 -0.135072 1.000000
In [76]:
# Generate a mask for the upper triangle
#mask = np.zeros_like(df_corr, dtype=np.bool)
#mask[np.triu_indices_from(mask)] = True

sns.heatmap(df_corr, cmap='rainbow', annot=False, fmt=".2f", center=0, square=False, linewidths=.75,
            #mask=mask,
           );
plt.title('Correlation Matrix', fontsize=18)
plt.show()
In [77]:
getHighlyCorrelatedColumns(dataset_network_train, 10)
Out[77]:
level_0 level_1 0
0 num_compromised num_root 0.998833
1 serror_rate srv_serror_rate 0.993289
2 rerror_rate srv_rerror_rate 0.989008
3 srv_serror_rate dst_host_srv_serror_rate 0.986252
4 dst_host_serror_rate dst_host_srv_serror_rate 0.985052
5 serror_rate dst_host_srv_serror_rate 0.981139
6 serror_rate dst_host_serror_rate 0.979373
7 srv_serror_rate dst_host_serror_rate 0.977596
8 srv_rerror_rate dst_host_srv_rerror_rate 0.970208
9 rerror_rate dst_host_srv_rerror_rate 0.964449
In [78]:
upper = df_corr.where(np.triu(np.ones(df_corr.shape), k=1).astype(np.bool))

to_drop = [column for column in upper.columns if any(upper[column] > 0.9)]

print(f'Columns to drop from Train and Test datasets are {to_drop}.')

dataset_network_train.drop(columns=to_drop, axis=1, inplace=True)
dataset_network_test.drop(columns=to_drop, axis=1, inplace=True)

(dataset_network_train.shape, dataset_network_test.shape)
Columns to drop from Train and Test datasets are ['srv_serror_rate', 'srv_rerror_rate', 'dst_host_same_srv_count', 'dst_host_serror_rate', 'dst_host_srv_serror_rate'].
Out[78]:
((125973, 38), (22544, 38))

Understanding Target Variable by Classes

In [79]:
dataset_network_train['attack_type_twoclass'] = dataset_network_train['attack_type'].apply(attackTypeConverter)
dataset_network_test['attack_type_twoclass'] = dataset_network_test['attack_type'].apply(attackTypeConverter)

dataset_network_train['attack_type_twoclass_num'] = dataset_network_train['attack_type'].apply(attackTypeNumConverter)
dataset_network_test['attack_type_twoclass_num'] = dataset_network_test['attack_type'].apply(attackTypeNumConverter)
In [80]:
plt.subplot(221)
plot_countplot(dataset_network_train.attack_type_twoclass, 'Attack Type - Two Class (train feature) Distribution', 'Attack Type(s)')
plt.subplot(222)
plot_valuecount_pieplot(dataset_network_train.attack_type_twoclass, 'Attack Type - Two Class (train feature) Distribution Percentage')
plt.subplot(223)
plot_countplot(dataset_network_test.attack_type_twoclass, 'Attack Type - Two Class (test feature) Distribution', 'Attack Type(s)')
plt.subplot(224)
plot_valuecount_pieplot(dataset_network_test.attack_type_twoclass, 'Attack Type - Two Class (test feature) Distribution Percentage')
plt.tight_layout()
plt.show()
<Figure size 1080x864 with 0 Axes>

Key Observations

  • When dataset distributed in two attack types ("Normal", "Anomaly") then the dataset is nearly balanced
In [81]:
dataset_network_train['attack_type_fiveclass'] = dataset_network_train['attack_type'].apply(attackTypeMultiConverter)
dataset_network_test['attack_type_fiveclass'] = dataset_network_test['attack_type'].apply(attackTypeMultiConverter)

dataset_network_train['attack_type_fiveclass_num'] = dataset_network_train['attack_type'].apply(attackTypeMultiNumConverter)
dataset_network_test['attack_type_fiveclass_num'] = dataset_network_test['attack_type'].apply(attackTypeMultiNumConverter)
In [82]:
plt.subplot(221)
plot_countplot(dataset_network_train.attack_type_fiveclass, 'Attack Type - Five Class (train feature) Distribution', 'Attack Type(s)')
plt.subplot(222)
plot_valuecount_pieplot(dataset_network_train.attack_type_fiveclass, 'Attack Type - Five Class (train feature) Distribution Percentage')
plt.subplot(223)
plot_countplot(dataset_network_test.attack_type_fiveclass, 'Attack Type - Five Class (test feature) Distribution', 'Attack Type(s)')
plt.subplot(224)
plot_valuecount_pieplot(dataset_network_test.attack_type_fiveclass, 'Attack Type - Five Class (test feature) Distribution Percentage')
plt.tight_layout()
plt.show()
<Figure size 1080x864 with 0 Axes>

Key Observations

  • When dataset distributed in five attack types ("Normal", "DoS", "Probe", "R2L", "U2R") then the dataset is unbalanced.
  • 88.6% dataset is distributed between attack types - "Normal", and "DoS".
In [83]:
(dataset_network_train.shape, dataset_network_test.shape)
Out[83]:
((125973, 42), (22544, 42))

Perform exploratory data analysis (EDA) by attack_type (five classes)

In [84]:
sns.lmplot(x='dst_host_same_src_port_rate', y='dst_host_srv_diff_host_rate', hue='attack_type_fiveclass', 
           data=dataset_network_train, size=11
          );
plt.title('"dst_host_same_src_port_rate" vs "dst_host_srv_diff_host_rate"', fontsize=18)
plt.show()
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\regression.py:546: UserWarning:

The `size` paramter has been renamed to `height`; please update your code.

Key Observations

  • dst_host_same_src_port_rate has slight effect on the intrusion type.
  • for "dst_host_same_src_port_rate" value greater than equal to 1 it can be "probe" and "r2l""
In [85]:
sns.lmplot(x='duration', y='src_bytes', hue='attack_type_fiveclass', data=dataset_network_train, size=11);
plt.title('"duration" vs "src_bytes"', fontsize=18)
plt.show()
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\regression.py:546: UserWarning:

The `size` paramter has been renamed to `height`; please update your code.

Key Observations

  • For duration Greater than 30000 we can see it's 'probe', therefore duration itself is a strong predictor.
In [86]:
sns.lmplot(x='dst_host_count', y='serror_rate', hue='attack_type_fiveclass', data=dataset_network_train, size=11);
plt.title('"dst_host_count" vs "serror_rate"', fontsize=18)
plt.show()
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\regression.py:546: UserWarning:

The `size` paramter has been renamed to `height`; please update your code.

In [87]:
sns.lmplot(x='count', y='serror_rate', hue='attack_type_fiveclass', data=dataset_network_train, size=11);
plt.title('"count" vs "serror_rate"', fontsize=18)
plt.show()
C:\ProgramData\Anaconda3\lib\site-packages\seaborn\regression.py:546: UserWarning:

The `size` paramter has been renamed to `height`; please update your code.

In [88]:
sns.pointplot(x='flag', y='land', hue='attack_type_fiveclass', data=dataset_network_train, join=False);
plt.title('"flag" vs "land"', fontsize=18)
plt.show()

Key Observations

  • No such clear identification
In [89]:
mosaic(dataset_network_train, ['service', 'protocol_type']);
plt.title('"service" vs "protocol_type"', fontsize=18)
plt.show()
In [90]:
mosaic(dataset_network_train, ['service', 'flag']);
plt.title('"service" vs "flag"', fontsize=18)
plt.show()

Handling Nominal Features

In [91]:
unique_protocol_type = pd.concat([dataset_network_train.protocol_type,
                                 dataset_network_test.protocol_type],
                                 ignore_index=True).unique().ravel()
unique_service = pd.concat([dataset_network_train.service,
                                 dataset_network_test.service],
                                 ignore_index=True).unique().ravel()
unique_flag = pd.concat([dataset_network_train.flag,
                                 dataset_network_test.flag],
                                 ignore_index=True).unique().ravel()
print(unique_protocol_type)
print(unique_service)
print(unique_flag)
['tcp' 'udp' 'icmp']
['ftp_data' 'other' 'private' 'http' 'remote_job' 'name' 'netbios_ns'
 'eco_i' 'mtp' 'telnet' 'finger' 'domain_u' 'supdup' 'uucp_path' 'Z39_50'
 'smtp' 'csnet_ns' 'uucp' 'netbios_dgm' 'urp_i' 'auth' 'domain' 'ftp'
 'bgp' 'ldap' 'ecr_i' 'gopher' 'vmnet' 'systat' 'http_443' 'efs' 'whois'
 'imap4' 'iso_tsap' 'echo' 'klogin' 'link' 'sunrpc' 'login' 'kshell'
 'sql_net' 'time' 'hostnames' 'exec' 'ntp_u' 'discard' 'nntp' 'courier'
 'ctf' 'ssh' 'daytime' 'shell' 'netstat' 'pop_3' 'nnsp' 'IRC' 'pop_2'
 'printer' 'tim_i' 'pm_dump' 'red_i' 'netbios_ssn' 'rje' 'X11' 'urh_i'
 'http_8001' 'aol' 'http_2784' 'tftp_u' 'harvest']
['SF' 'S0' 'REJ' 'RSTR' 'SH' 'RSTO' 'S1' 'RSTOS0' 'S3' 'S2' 'OTH']
In [92]:
def protocolTypeNumConverter(protocol_type):
    if protocol_type == 'tcp' :
        return 0
    elif protocol_type == 'udp' :
        return 1
    elif protocol_type == 'icmp' :
        return 2
    
def flagNumConverter(flag):
    if flag == 'SF' :
        return 0
    elif flag == 'S0' :
        return 1
    elif flag == 'REJ' :
        return 2
    elif flag == 'RSTR' :
        return 3
    elif flag == 'SH' :
        return 4
    elif flag == 'RSTO' :
        return 5
    elif flag == 'S1' :
        return 6
    elif flag == 'RSTOS0' :
        return 7
    elif flag == 'S3' :
        return 8
    elif flag == 'S2' :
        return 9
    elif flag == 'OTH' :
        return 10
In [93]:
serviceNumConverter = {
    'ftp_data': 0,
    'other': 1,
    'private': 2,
    'http': 3,
    'remote_job': 4,
    'name': 5,
    'netbios_ns': 6,
    'eco_i': 7,
    'mtp': 8,
    'telnet': 9,
    'finger': 10,
    'domain_u': 11,
    'supdup': 12,
    'uucp_path': 13,
    'Z39_50': 14,
    'smtp': 15,
    'csnet_ns': 16,
    'uucp': 17,
    'netbios_dgm': 18,
    'urp_i': 19,
    'auth': 20,
    'domain': 21,
    'ftp': 22,
    'bgp': 23,
    'ldap': 24,
    'ecr_i': 25,
    'gopher': 26,
    'vmnet': 27,
    'systat': 28,
    'http_443': 29,
    'efs': 30,
    'whois': 31,
    'imap4': 32,
    'iso_tsap': 33,
    'echo': 34,
    'klogin': 35,
    'link': 36,
    'sunrpc': 37,
    'login': 38,
    'kshell': 39,
    'sql_net': 40,
    'time': 41,
    'hostnames': 42,
    'exec': 43,
    'ntp_u': 44,
    'discard': 45,
    'nntp': 46,
    'courier': 47,
    'ctf': 48,
    'ssh': 49,
    'daytime': 50,
    'shell': 51,
    'netstat': 52,
    'pop_3': 53,
    'nnsp': 54,
    'IRC': 55,
    'pop_2': 56,
    'printer': 57,
    'tim_i': 58,
    'pm_dump': 59,
    'red_i': 60,
    'netbios_ssn': 61,
    'rje': 62,
    'X11': 63,
    'urh_i': 64,
    'http_8001': 65,
    'aol': 66,
    'http_2784': 67,
    'tftp_u': 68,
    'harvest': 69,
    }
In [94]:
dataset_network_train['protocol_type'] = dataset_network_train['protocol_type'].apply(protocolTypeNumConverter)
dataset_network_test['protocol_type'] = dataset_network_test['protocol_type'].apply(protocolTypeNumConverter)
In [95]:
dataset_network_train.flag = dataset_network_train.flag.apply(flagNumConverter)
dataset_network_test.flag = dataset_network_test.flag.apply(flagNumConverter)
In [96]:
dataset_network_train.service = [serviceNumConverter[item] for item in dataset_network_train.service]
dataset_network_test.service = [serviceNumConverter[item] for item in dataset_network_test.service]

# Using map() function
# dataset_network_train['service'] = dataset_network_train['service'].map(serviceNumConverter).astype(int)
# dataset_network_test['service'] = dataset_network_test['service'].map(serviceNumConverter).astype(int)

Seperate data and target features

In [97]:
X_train=dataset_network_train.drop(['attack_type', 
                            'attack_type_twoclass','attack_type_twoclass_num', 
                            'attack_type_fiveclass','attack_type_fiveclass_num'
                           ], axis=1
                          )
y_train=dataset_network_train['attack_type_twoclass_num']
z_train=dataset_network_train['attack_type_fiveclass_num']

X_test=dataset_network_test.drop(['attack_type', 
                            'attack_type_twoclass', 'attack_type_twoclass_num', 
                            'attack_type_fiveclass', 'attack_type_fiveclass_num'
                           ], axis=1
                          )
y_test=dataset_network_test['attack_type_twoclass_num']
z_test=dataset_network_test['attack_type_fiveclass_num']

Remove Columns with Zero Standard

In [98]:
columnsWithZeroStdToRemove = getZeroStdColumns(X_train)
print(f'Columns with Zero STD to drop from Train and Test dataset(s) are {columnsWithZeroStdToRemove}.')
X_train.drop(columnsWithZeroStdToRemove, axis=1, inplace=True)
X_test.drop(columnsWithZeroStdToRemove, axis=1, inplace=True)
Columns with Zero STD to drop from Train and Test dataset(s) are ['num_outbound_cmds'].

Examin the shape of data and target variables after feature engineering

In [99]:
(X_train.shape, y_train.shape, z_train.shape)
Out[99]:
((125973, 36), (125973,), (125973,))
In [100]:
(X_test.shape, y_test.shape, z_test.shape)
Out[100]:
((22544, 36), (22544,), (22544,))

The Elbow Method Plot - understand the possible no(s) of cluster(s) of dataset

In [101]:
onetoten = [i for i in range(1,11)]
wcss = []
for i in onetoten:
    kmeans = KMeans(n_clusters = i, init = 'k-means++')
    kmeans.fit_predict(X_train)
    wcss.append(kmeans.inertia_)

# The Elbow Method Plot using Plotly
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# data = [go.Scatter(x = onetoten, y = wcss, mode='lines+markers',
#                    marker = dict(symbol = 'circle',),
#                   )
#        ]
# layout = getPlotlyLayout('KMeans - The Elbow Method','Number of clusters','WCSS')
# figure = dict(data = data, layout = layout)
# py.iplot(figure)

# The Elbow Method Plot using matplotlib
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
plt.plot(range(1, 11), wcss, marker='o')
plt.title('KMeans - The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

Key Observations

  • Can see from the above graph that dataset is distributed in 2 cluster minimum.
  • Can see from the above graph that dataset is distributed in 5 cluster maximun.
  • Can see that after 5 possible clusters, the line is flat so dataset can't divied futher into clusters. Thus dataset can have minimum possible 2 clusters and maximum possible 5 clusters.
In [102]:
# scaler = StandardScaler()
# X_train_scaled = scaler.fit_transform(X_train)
# X_test_scaled = scaler.fit_transform(X_test)

# X_train = pd.DataFrame(X_train_scaled)
# X_test = pd.DataFrame(X_test_scaled)

Key Observations

  • Applied "StandardScaler" as well as "MinMaxScaler" but overall prediction accuracy doesn't have any effect and remains same. So continue to use normal data rather than using scaled dataset.

Modelling

In [103]:
n_cpus_avaliable = multiprocessing.cpu_count()

print(f'We\'ve got {n_cpus_avaliable} cpus to work with.')
We've got 4 cpus to work with.

Binomial Classification: Activity is normal or attack

Binomial Classification: Using NN Keras Classifier

Perform Classification

In [104]:
#earlyStopping = EarlyStopping(monitor='val_binary_accuracy', patience=1, verbose=0, mode='max')

kerasnn_model = Sequential()
kerasnn_model.add(Dense(32, input_shape=(X_train.shape[1], ), activation='relu',
          kernel_regularizer=regularizers.l2(0.01),
          ))
kerasnn_model.add(Dropout(0.25))
kerasnn_model.add(BatchNormalization())
kerasnn_model.add(Dense(64, activation='relu',
          kernel_regularizer=regularizers.l2(0.01),
          ))
kerasnn_model.add(GaussianNoise(0.1))
kerasnn_model.add(Dense(128, activation='relu',
          kernel_regularizer=regularizers.l2(0.01),
          ))
kerasnn_model.add(Dropout(0.25))
kerasnn_model.add(BatchNormalization())
kerasnn_model.add(Dense(128, activation='relu',
          kernel_regularizer=regularizers.l2(0.01),
          ))
kerasnn_model.add(GaussianNoise(0.1))
kerasnn_model.add(Dense(64, activation='relu',
          kernel_regularizer=regularizers.l2(0.01),
          ))
kerasnn_model.add(Dropout(0.25))
kerasnn_model.add(BatchNormalization())
kerasnn_model.add(Dense(32, activation='relu',
          kernel_regularizer=regularizers.l2(0.01),
          ))
kerasnn_model.add(Dense(1, activation='sigmoid'))
kerasnn_model.summary()

kerasnn_model.compile(loss="binary_crossentropy", optimizer="adam", metrics=['binary_accuracy'])
#kerasnn_model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=9, callbacks=[earlyStopping], shuffle=True, verbose=1)
kerasnn_model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=5, shuffle=True, verbose=0)
kerasnn_model.evaluate(X_test, y_test)
WARNING:tensorflow:From C:\ProgramData\Anaconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From C:\ProgramData\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:3445: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 32)                1184      
_________________________________________________________________
dropout_1 (Dropout)          (None, 32)                0         
_________________________________________________________________
batch_normalization_1 (Batch (None, 32)                128       
_________________________________________________________________
dense_2 (Dense)              (None, 64)                2112      
_________________________________________________________________
gaussian_noise_1 (GaussianNo (None, 64)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 128)               8320      
_________________________________________________________________
dropout_2 (Dropout)          (None, 128)               0         
_________________________________________________________________
batch_normalization_2 (Batch (None, 128)               512       
_________________________________________________________________
dense_4 (Dense)              (None, 128)               16512     
_________________________________________________________________
gaussian_noise_2 (GaussianNo (None, 128)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 64)                8256      
_________________________________________________________________
dropout_3 (Dropout)          (None, 64)                0         
_________________________________________________________________
batch_normalization_3 (Batch (None, 64)                256       
_________________________________________________________________
dense_6 (Dense)              (None, 32)                2080      
_________________________________________________________________
dense_7 (Dense)              (None, 1)                 33        
=================================================================
Total params: 39,393
Trainable params: 38,945
Non-trainable params: 448
_________________________________________________________________
WARNING:tensorflow:From C:\ProgramData\Anaconda3\lib\site-packages\tensorflow\python\ops\math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
22544/22544 [==============================] - 1s 37us/step
Out[104]:
[0.8344427692374073, 0.7017831795599716]
In [105]:
# Predicting the Test set results
y_pred = kerasnn_model.predict_classes(X_test)
y_pred_proba = kerasnn_model.predict_proba(X_test)

ac = accuracy_score(y_test, y_pred)
print('The accuracy score of the Keras NN (Two Class) model is: {}%'.format(ac * 100))
print('\n')

cr = classification_report(y_test, y_pred)
print(cr)
print('\n')

skplt.metrics.plot_confusion_matrix(y_test, y_pred, title='Binomial Classification (Keras NN Confusion Matrix)',
                                    x_tick_rotation=90, 
                                    cmap='Oranges',
                                   );
The accuracy score of the Keras NN (Two Class) model is: 70.17831795599716%


              precision    recall  f1-score   support

           0       0.60      0.94      0.73      9711
           1       0.92      0.52      0.67     12833

   micro avg       0.70      0.70      0.70     22544
   macro avg       0.76      0.73      0.70     22544
weighted avg       0.78      0.70      0.69     22544



Binomial Classification: Using LightGBM Classifier

Perform Classification

In [106]:
def lgbm_status_print_twoclass(optimal_result):
    all_models = pd.DataFrame(lgbm_bayes_cv_tuner_twoclass.cv_results_)
    best_params = pd.Series(lgbm_bayes_cv_tuner_twoclass.best_params_)
    print('Model #{}\nBest ROC-AUC: {}\nBest params: {}\n'.format(len(all_models),
            np.round(lgbm_bayes_cv_tuner_twoclass.best_score_, 4),
            lgbm_bayes_cv_tuner_twoclass.best_params_)
         )


lgbm_bayes_cv_tuner_twoclass = BayesSearchCV(
    estimator=lgbm.LGBMClassifier(n_jobs=n_cpus_avaliable,
                                  objective='binary',
                                  metric='binary_logloss',
                                  class_weight='balanced',
                                  silent=True),
    search_spaces={
        'boosting_type': ['gbdt', 'dart', 'rf'],
        'num_leaves': (1, 50),
        'max_depth': (1, 25),
        'learning_rate': (0.01, 1.0, 'log-uniform'),
        'n_estimators': (100, 300),
        'min_split_gain': (0.01, 1.0, 'uniform'),
        'min_child_weight': (0.01, 1.0, 'uniform'),
        'min_child_samples': (1, 50),
        'subsample': (0.01, 1.0, 'uniform'),
        'subsample_freq': (1, 50),
        'colsample_bytree': (0.01, 1.0, 'uniform'),
        'reg_lambda': (1e-9, 1000, 'log-uniform'),
        'bagging_fraction': (0.01, 1.0, 'uniform'),
        'feature_fraction': (0.01, 1.0, 'uniform'),
        },
    scoring='roc_auc',
    cv=StratifiedKFold(n_splits=9, shuffle=True),
    n_jobs=n_cpus_avaliable,
    n_iter=5,
    refit=True,
    verbose=0,
    )


lgbm_result_twoclass = lgbm_bayes_cv_tuner_twoclass.fit(X_train, y_train,
        callback=lgbm_status_print_twoclass)

lgbm_twoclass_model = lgbm_result_twoclass.best_estimator_

print(lgbm_twoclass_model)
Model #1
Best ROC-AUC: 1.0
Best params: {'bagging_fraction': 0.6763361685130609, 'boosting_type': 'gbdt', 'colsample_bytree': 0.11688537090149818, 'feature_fraction': 0.9092296667254578, 'learning_rate': 0.02705915672082682, 'max_depth': 19, 'min_child_samples': 49, 'min_child_weight': 0.3909932699621439, 'min_split_gain': 0.7838364334990557, 'n_estimators': 210, 'num_leaves': 40, 'reg_lambda': 0.1376853048221502, 'subsample': 0.4881428682426217, 'subsample_freq': 7}

Model #2
Best ROC-AUC: 1.0
Best params: {'bagging_fraction': 0.6763361685130609, 'boosting_type': 'gbdt', 'colsample_bytree': 0.11688537090149818, 'feature_fraction': 0.9092296667254578, 'learning_rate': 0.02705915672082682, 'max_depth': 19, 'min_child_samples': 49, 'min_child_weight': 0.3909932699621439, 'min_split_gain': 0.7838364334990557, 'n_estimators': 210, 'num_leaves': 40, 'reg_lambda': 0.1376853048221502, 'subsample': 0.4881428682426217, 'subsample_freq': 7}

Model #3
Best ROC-AUC: 1.0
Best params: {'bagging_fraction': 0.6763361685130609, 'boosting_type': 'gbdt', 'colsample_bytree': 0.11688537090149818, 'feature_fraction': 0.9092296667254578, 'learning_rate': 0.02705915672082682, 'max_depth': 19, 'min_child_samples': 49, 'min_child_weight': 0.3909932699621439, 'min_split_gain': 0.7838364334990557, 'n_estimators': 210, 'num_leaves': 40, 'reg_lambda': 0.1376853048221502, 'subsample': 0.4881428682426217, 'subsample_freq': 7}

Model #4
Best ROC-AUC: 1.0
Best params: {'bagging_fraction': 0.6763361685130609, 'boosting_type': 'gbdt', 'colsample_bytree': 0.11688537090149818, 'feature_fraction': 0.9092296667254578, 'learning_rate': 0.02705915672082682, 'max_depth': 19, 'min_child_samples': 49, 'min_child_weight': 0.3909932699621439, 'min_split_gain': 0.7838364334990557, 'n_estimators': 210, 'num_leaves': 40, 'reg_lambda': 0.1376853048221502, 'subsample': 0.4881428682426217, 'subsample_freq': 7}

Model #5
Best ROC-AUC: 1.0
Best params: {'bagging_fraction': 0.6763361685130609, 'boosting_type': 'gbdt', 'colsample_bytree': 0.11688537090149818, 'feature_fraction': 0.9092296667254578, 'learning_rate': 0.02705915672082682, 'max_depth': 19, 'min_child_samples': 49, 'min_child_weight': 0.3909932699621439, 'min_split_gain': 0.7838364334990557, 'n_estimators': 210, 'num_leaves': 40, 'reg_lambda': 0.1376853048221502, 'subsample': 0.4881428682426217, 'subsample_freq': 7}

LGBMClassifier(bagging_fraction=0.6763361685130609, boosting_type='gbdt',
        class_weight='balanced', colsample_bytree=0.11688537090149818,
        feature_fraction=0.9092296667254578, importance_type='split',
        learning_rate=0.02705915672082682, max_depth=19,
        metric='binary_logloss', min_child_samples=49,
        min_child_weight=0.3909932699621439,
        min_split_gain=0.7838364334990557, n_estimators=210, n_jobs=4,
        num_leaves=40, objective='binary', random_state=None,
        reg_alpha=0.0, reg_lambda=0.1376853048221502, silent=True,
        subsample=0.4881428682426217, subsample_for_bin=200000,
        subsample_freq=7)
In [107]:
# Predicting the Test set results
y_pred = lgbm_twoclass_model.predict(X_test)
y_pred_proba = lgbm_twoclass_model.predict_proba(X_test)

ac = accuracy_score(y_test, y_pred)
print('The accuracy score of the LightGBM (Two Class) model is: {}%'.format(ac * 100))
print('\n')

cr = classification_report(y_test, y_pred)
print(cr)
print('\n')

skplt.metrics.plot_confusion_matrix(y_test, y_pred, title='Binomial Classification (LightGBM Confusion Matrix)',
                                    x_tick_rotation=90, 
                                    cmap='Oranges',
                                   );
print('\n')

skplt.metrics.plot_precision_recall(y_test, y_pred_proba,
                                    title='Binomial Classification (LightGBM Precision-Recall Curve)',
                                   );
print('\n')

skplt.metrics.plot_roc(y_test, y_pred_proba,
                       title='Binomial Classification (LightGBM ROC Curves)',
                      );
The accuracy score of the LightGBM (Two Class) model is: 81.65809084457062%


              precision    recall  f1-score   support

           0       0.71      0.97      0.82      9711
           1       0.97      0.70      0.81     12833

   micro avg       0.82      0.82      0.82     22544
   macro avg       0.84      0.84      0.82     22544
weighted avg       0.86      0.82      0.82     22544







In [108]:
feature_importance = pd.DataFrame({'imp': lgbm_twoclass_model.feature_importances_, 'col': X_train.columns})
feature_importance = feature_importance.sort_values(['imp', 'col'], ascending=[True, False]).iloc[-30:]
feature_importance.plot(kind='barh', x='col', y='imp', color=belize_light_flavor);
plt.title('Binomial Classification (LightGBM - Feature Importance(s))', fontsize=18)
plt.show()

Binomial Classification: Using LightGBM Classifier (Hyperopt)

Perform Classification

In [109]:
def lgbmc_objective(params):
    params = {
        'num_leaves': int(params['num_leaves']),
        'max_depth': int(params['max_depth']),
        'n_estimators': int(params['n_estimators']),
        'min_child_samples': int(params['min_child_samples']),
        'subsample_freq': int(params['min_child_samples']),
        }
    lgbmClassifier = lgbm.LGBMClassifier(
        n_jobs=n_cpus_avaliable,
        objective='binary',
        metric='binary_logloss',
        class_weight='balanced',
        silent=True,
        **params
        )

    # if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.
    lgbmcScore = cross_val_score(lgbmClassifier, X_train, y_train,
                                 cv=StratifiedKFold(n_splits=9,
                                 shuffle=True),
                                 n_jobs=n_cpus_avaliable).mean()

    # print('LGBM-Classifier Score: {:.4f}, Parameters are: {}'.format(lgbmcScore, params))

    return lgbmcScore


lgbmc_space = {
    'num_leaves': hp.quniform('num_leaves', 1, 50, 1),
    'max_depth': hp.quniform('max_depth', 1, 25, 1),
    'learning_rate': hp.qloguniform('learning_rate', 0.01, 1.0, 0.01),
    'n_estimators': hp.quniform('n_estimators', 100, 300, 2),
    'min_split_gain': hp.quniform('min_split_gain', 0.01, 1.0, 0.01),
    'min_child_weight': hp.quniform('min_child_weight', 0.01, 1.0,
                                    0.01),
    'min_child_samples': hp.quniform('min_child_samples', 1, 50, 1),
    'subsample': hp.quniform('subsample', 0.01, 1.0, 0.01),
    'subsample_freq': hp.quniform('subsample_freq', 1, 50, 1),
    'colsample_bytree': hp.quniform('colsample_bytree', 0.01, 1.0,
                                    0.01),
    'reg_lambda': hp.qloguniform('reg_lambda', 1e-5, 1000, 0.01),
    'bagging_fraction': hp.quniform('bagging_fraction', 0.01, 1.0,
                                    0.01),
    'feature_fraction': hp.quniform('feature_fraction', 0.01, 1.0,
                                    0.01),
    }

trials = Trials()
lgbmc_best = fmin(fn=lgbmc_objective, space=lgbmc_space,
                  algo=tpe.suggest, max_evals=5, trials=trials)
print(trials.losses())
print('\n')
print(lgbmc_best)
100%|█████████████████████████████████████████████████████| 5/5 [01:47<00:00, 20.02s/it, best loss: 0.9992934903231594]
[0.9994284432793084, 0.999706281729223, 0.9996507193702173, 0.9992934903231594, 0.9996427834298629]


{'bagging_fraction': 0.11, 'colsample_bytree': 0.37, 'feature_fraction': 0.33, 'learning_rate': 2.13, 'max_depth': 4.0, 'min_child_samples': 45.0, 'min_child_weight': 0.08, 'min_split_gain': 0.34, 'n_estimators': 152.0, 'num_leaves': 9.0, 'reg_lambda': inf, 'subsample': 0.02, 'subsample_freq': 34.0}
In [110]:
space_eval(lgbmc_space, lgbmc_best)
Out[110]:
{'bagging_fraction': 0.11,
 'colsample_bytree': 0.37,
 'feature_fraction': 0.33,
 'learning_rate': 2.13,
 'max_depth': 4.0,
 'min_child_samples': 45.0,
 'min_child_weight': 0.08,
 'min_split_gain': 0.34,
 'n_estimators': 152.0,
 'num_leaves': 9.0,
 'reg_lambda': inf,
 'subsample': 0.02,
 'subsample_freq': 34.0}
In [111]:
lgbmc_best=convertIntFloatToInt(lgbmc_best)
lgbmc_best
Out[111]:
{'bagging_fraction': 0.11,
 'colsample_bytree': 0.37,
 'feature_fraction': 0.33,
 'learning_rate': 2.13,
 'max_depth': 4,
 'min_child_samples': 45,
 'min_child_weight': 0.08,
 'min_split_gain': 0.34,
 'n_estimators': 152,
 'num_leaves': 9,
 'reg_lambda': inf,
 'subsample': 0.02,
 'subsample_freq': 34}
In [112]:
lgbmClassifierHyperopt = lgbm.LGBMClassifier(
        n_jobs=n_cpus_avaliable,
        objective='binary',
        metric='binary_logloss',
        class_weight='balanced',
        silent=True,
        **lgbmc_best
        )
lgbmClassifierHyperopt.fit(X_train, y_train)

# Predicting the Test set results
y_pred = lgbmClassifierHyperopt.predict(X_test)
y_pred_proba = lgbmClassifierHyperopt.predict_proba(X_test)

ac = accuracy_score(y_test, y_pred)
print('The accuracy score of the LightGBM Hyperopt (Two Class) model is: {}%'.format(ac * 100))
print('\n')

cr = classification_report(y_test, y_pred)
print(cr)
print('\n')

skplt.metrics.plot_confusion_matrix(y_test, y_pred, title='Binomial Classification (LightGBM Hyperopt Confusion Matrix)',
                                    x_tick_rotation=90, 
                                    cmap='Oranges',
                                   );
print('\n')

skplt.metrics.plot_precision_recall(y_test, y_pred_proba,
                                    title='Binomial Classification (LightGBM Hyperopt Precision-Recall Curve)',
                                   );
print('\n')

skplt.metrics.plot_roc(y_test, y_pred_proba,
                       title='Binomial Classification (LightGBM Hyperopt ROC Curves)',
                      );
The accuracy score of the LightGBM Hyperopt (Two Class) model is: 43.07576295244854%


C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1143: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1143: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1143: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.

              precision    recall  f1-score   support

           0       0.43      1.00      0.60      9711
           1       0.00      0.00      0.00     12833

   micro avg       0.43      0.43      0.43     22544
   macro avg       0.22      0.50      0.30     22544
weighted avg       0.19      0.43      0.26     22544







In [113]:
feature_importance = pd.DataFrame({'imp': lgbmClassifierHyperopt.feature_importances_, 'col': X_train.columns})
feature_importance = feature_importance.sort_values(['imp', 'col'], ascending=[True, False]).iloc[-30:]
feature_importance.plot(kind='barh', x='col', y='imp', color=belize_light_flavor);
plt.title('Binomial Classification (LightGBM - Feature Importance(s))', fontsize=18)
plt.show()

Binomial Classification: Using XGBoost Classifier

Perform Classification

In [114]:
def status_print_twoclass(optimal_result):
    all_models = pd.DataFrame(bayes_cv_tuner_twoclass.cv_results_)
    best_params = pd.Series(bayes_cv_tuner_twoclass.best_params_)
    print('Model #{}\nBest ROC-AUC: {}\nBest params: {}\n'.format(len(all_models), 
            np.round(bayes_cv_tuner_twoclass.best_score_, 4), 
            bayes_cv_tuner_twoclass.best_params_)
         )


bayes_cv_tuner_twoclass = BayesSearchCV(
    estimator=xgb.XGBClassifier(
        n_jobs=n_cpus_avaliable,
        objective='binary:logistic',
        eval_metric='auc',
        silent=1,
        tree_method='approx',
        device='cpu',
        ),
    search_spaces={
        'booster': ['gbtree', 'dart'],
        'learning_rate': (0.01, 1.0, 'log-uniform'),
        'max_delta_step': (0, 20),
        'max_depth': (0, 25),
        'min_child_weight': (0, 10),
        'n_estimators': (100, 300),
        'subsample': (0.01, 1.0, 'uniform'),
        'colsample_bytree': (0.01, 1.0, 'uniform'),
        'colsample_bylevel': (0.01, 1.0, 'uniform'),
        'reg_lambda': (1e-9, 1000, 'log-uniform'),
        'gamma': (1e-9, 0.5, 'log-uniform'),
        'scale_pos_weight': (1e-6, 500, 'log-uniform'),
        },
    scoring='roc_auc',
    cv=StratifiedKFold(n_splits=9, shuffle=True),
    n_jobs=n_cpus_avaliable,
    n_iter=7,
    refit=True,
    verbose=0,
    )

result_twoclass = bayes_cv_tuner_twoclass.fit(X_train, y_train,
        callback=status_print_twoclass)

xgb_twoclass_model = result_twoclass.best_estimator_

print(xgb_twoclass_model)
Model #1
Best ROC-AUC: 0.987
Best params: {'booster': 'dart', 'colsample_bylevel': 0.5168026020527496, 'colsample_bytree': 0.5512855490191089, 'gamma': 0.23498516210735837, 'learning_rate': 0.029026949949969905, 'max_delta_step': 12, 'max_depth': 1, 'min_child_weight': 2, 'n_estimators': 116, 'reg_lambda': 0.009473812440859838, 'scale_pos_weight': 4.821037947653469, 'subsample': 0.041253633514046546}

Model #2
Best ROC-AUC: 1.0
Best params: {'booster': 'gbtree', 'colsample_bylevel': 0.7742671231359579, 'colsample_bytree': 0.641156321702667, 'gamma': 0.027455058133773455, 'learning_rate': 0.13207337025323213, 'max_delta_step': 19, 'max_depth': 13, 'min_child_weight': 2, 'n_estimators': 189, 'reg_lambda': 3.016771638452131e-06, 'scale_pos_weight': 257.4257611046447, 'subsample': 0.1883595592814563}

Model #3
Best ROC-AUC: 1.0
Best params: {'booster': 'gbtree', 'colsample_bylevel': 0.7742671231359579, 'colsample_bytree': 0.641156321702667, 'gamma': 0.027455058133773455, 'learning_rate': 0.13207337025323213, 'max_delta_step': 19, 'max_depth': 13, 'min_child_weight': 2, 'n_estimators': 189, 'reg_lambda': 3.016771638452131e-06, 'scale_pos_weight': 257.4257611046447, 'subsample': 0.1883595592814563}

Model #4
Best ROC-AUC: 1.0
Best params: {'booster': 'gbtree', 'colsample_bylevel': 0.7742671231359579, 'colsample_bytree': 0.641156321702667, 'gamma': 0.027455058133773455, 'learning_rate': 0.13207337025323213, 'max_delta_step': 19, 'max_depth': 13, 'min_child_weight': 2, 'n_estimators': 189, 'reg_lambda': 3.016771638452131e-06, 'scale_pos_weight': 257.4257611046447, 'subsample': 0.1883595592814563}

Model #5
Best ROC-AUC: 1.0
Best params: {'booster': 'gbtree', 'colsample_bylevel': 0.7742671231359579, 'colsample_bytree': 0.641156321702667, 'gamma': 0.027455058133773455, 'learning_rate': 0.13207337025323213, 'max_delta_step': 19, 'max_depth': 13, 'min_child_weight': 2, 'n_estimators': 189, 'reg_lambda': 3.016771638452131e-06, 'scale_pos_weight': 257.4257611046447, 'subsample': 0.1883595592814563}

Model #6
Best ROC-AUC: 1.0
Best params: {'booster': 'gbtree', 'colsample_bylevel': 0.7742671231359579, 'colsample_bytree': 0.641156321702667, 'gamma': 0.027455058133773455, 'learning_rate': 0.13207337025323213, 'max_delta_step': 19, 'max_depth': 13, 'min_child_weight': 2, 'n_estimators': 189, 'reg_lambda': 3.016771638452131e-06, 'scale_pos_weight': 257.4257611046447, 'subsample': 0.1883595592814563}

Model #7
Best ROC-AUC: 1.0
Best params: {'booster': 'gbtree', 'colsample_bylevel': 0.7742671231359579, 'colsample_bytree': 0.641156321702667, 'gamma': 0.027455058133773455, 'learning_rate': 0.13207337025323213, 'max_delta_step': 19, 'max_depth': 13, 'min_child_weight': 2, 'n_estimators': 189, 'reg_lambda': 3.016771638452131e-06, 'scale_pos_weight': 257.4257611046447, 'subsample': 0.1883595592814563}

[20:44:04] Tree method is selected to be 'approx'
XGBClassifier(base_score=0.5, booster='gbtree',
       colsample_bylevel=0.7742671231359579,
       colsample_bytree=0.641156321702667, device='cpu', eval_metric='auc',
       gamma=0.027455058133773455, learning_rate=0.13207337025323213,
       max_delta_step=19, max_depth=13, min_child_weight=2, missing=None,
       n_estimators=189, n_jobs=4, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=3.016771638452131e-06,
       scale_pos_weight=257.4257611046447, seed=None, silent=1,
       subsample=0.1883595592814563, tree_method='approx')
In [115]:
# Predicting the Test set results
y_pred = xgb_twoclass_model.predict(X_test)
y_pred_proba = xgb_twoclass_model.predict_proba(X_test)

ac = accuracy_score(y_test, y_pred)
print('The accuracy score of the XGBoost (Two Class) model is: {}%'.format(ac * 100))
print('\n')

cr = classification_report(y_test, y_pred)
print(cr)
print('\n')

skplt.metrics.plot_confusion_matrix(y_test, y_pred, title='Binomial Classification (XGBoost Confusion Matrix)',
                                    x_tick_rotation=90, 
                                    cmap='Oranges',
                                   );
print('\n')

skplt.metrics.plot_precision_recall(y_test, y_pred_proba,
                                    title='Binomial Classification (XGBoost Precision-Recall Curve)',
                                   );
print('\n')

skplt.metrics.plot_roc(y_test, y_pred_proba,
                       title='Binomial Classification (XGBoost ROC Curves)',
                      );
The accuracy score of the XGBoost (Two Class) model is: 88.19641589779987%


              precision    recall  f1-score   support

           0       0.80      0.97      0.88      9711
           1       0.97      0.82      0.89     12833

   micro avg       0.88      0.88      0.88     22544
   macro avg       0.89      0.89      0.88     22544
weighted avg       0.90      0.88      0.88     22544







In [116]:
feature_importance = pd.DataFrame({'imp': xgb_twoclass_model.feature_importances_, 'col': X_train.columns})
feature_importance = feature_importance.sort_values(['imp', 'col'], ascending=[True, False]).iloc[-30:]
feature_importance.plot(kind='barh', x='col', y='imp', color=belize_light_flavor);
plt.title('Binomial Classification (XGBoost - Feature Importance(s))', fontsize=18)
plt.show()

Binomial Classification: Using XGBoost with Semi-Supervised Classifier

Perform Classification

In [117]:
features = X_train.columns
target = 'attack_type_twoclass_num'
num_folds = 11

xgb_pseudolabeler_twoclass_model = PseudoLabeler(
    xgb_twoclass_model,
    X_test,
    features,
    target,
    sample_rate = 0.3
)

xgb_pseudolabeler_twoclass_model.fit(X_train, y_train)
#y_pred = xgb_pseudolabeler_twoclass_model.predict(X_test)
scores = cross_val_score(xgb_pseudolabeler_twoclass_model, X_train, y_train, cv=num_folds, scoring='roc_auc', n_jobs=1)
scores
[20:44:23] Tree method is selected to be 'approx'
[20:44:37] Tree method is selected to be 'approx'
[20:44:54] Tree method is selected to be 'approx'
[20:45:08] Tree method is selected to be 'approx'
[20:45:26] Tree method is selected to be 'approx'
[20:45:44] Tree method is selected to be 'approx'
[20:46:00] Tree method is selected to be 'approx'
[20:46:13] Tree method is selected to be 'approx'
[20:46:29] Tree method is selected to be 'approx'
[20:46:42] Tree method is selected to be 'approx'
[20:46:57] Tree method is selected to be 'approx'
[20:47:10] Tree method is selected to be 'approx'
[20:47:27] Tree method is selected to be 'approx'
[20:47:40] Tree method is selected to be 'approx'
[20:47:57] Tree method is selected to be 'approx'
[20:48:10] Tree method is selected to be 'approx'
[20:48:25] Tree method is selected to be 'approx'
[20:48:38] Tree method is selected to be 'approx'
[20:48:55] Tree method is selected to be 'approx'
[20:49:08] Tree method is selected to be 'approx'
[20:49:23] Tree method is selected to be 'approx'
[20:49:36] Tree method is selected to be 'approx'
[20:49:54] Tree method is selected to be 'approx'
[20:50:08] Tree method is selected to be 'approx'
Out[117]:
array([0.99999988, 0.9999856 , 0.99999911, 0.99999721, 0.99999905,
       0.99999255, 0.99999752, 0.99999948, 0.99999975, 0.99999933,
       0.99999936])
In [118]:
xgb_pseudolabeler_twoclass_model
Out[118]:
PseudoLabeler(features=Index(['duration', 'protocol_type', 'service', 'flag', 'src_bytes',
       'dst_bytes', 'land', 'wrong_fragment', 'urgent', 'hot',
       'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell',
       'su_attempted', 'num_root', 'num_file_creations', 'num_shells',
       'num_acc...ate',
       'dst_host_rerror_rate', 'dst_host_srv_rerror_rate', 'last_flag'],
      dtype='object'),
       model=XGBClassifier(base_score=0.5, booster='gbtree',
       colsample_bylevel=0.7742671231359579,
       colsample_bytree=0.641156321702667, device='cpu', eval_metric='auc',
       gamma=0.027455058133773455, learning_rate=0.13207337025323213,
       max_delta_step=19, max_depth=13, min_child_weigh...ght=257.4257611046447, seed=42, silent=1,
       subsample=0.1883595592814563, tree_method='approx'),
       sample_rate=0.3, seed=42, target='attack_type_twoclass_num',
       unlabled_data=       duration  protocol_type  service  flag  src_bytes  dst_bytes  land  \
0             0              0        2     2          0          0     0
1             0              0        2     2          0          0     0
2             2              0        0     0      1298...  21
22543                  0.44                      1.00         14

[22544 rows x 36 columns])
In [119]:
y_pred = xgb_pseudolabeler_twoclass_model.predict(X_test)
y_pred_proba = xgb_pseudolabeler_twoclass_model.predict_proba(X_test)

ac = accuracy_score(y_test, y_pred)
print('The accuracy score of the XGBoost (Semi-Supervised Model) (Two Class) model is: {}%'.format(ac * 100))
print('\n')

cr = classification_report(y_test, y_pred)
print(cr)
print('\n')

skplt.metrics.plot_confusion_matrix(y_test, y_pred, title='Binomial Classification (XGBoost (Semi-Supervised Model) Confusion Matrix)',
                                    x_tick_rotation=90, 
                                    cmap='Oranges',
                                   );
print('\n')

skplt.metrics.plot_precision_recall(y_test, y_pred_proba,
                                    title='Binomial Classification (XGBoost (Semi-Supervised Model) Precision-Recall Curve)',
                                   );
print('\n')

skplt.metrics.plot_roc(y_test, y_pred_proba,
                       title='Binomial Classification (XGBoost (Semi-Supervised Model) ROC Curves)',
                      );
The accuracy score of the XGBoost (Semi-Supervised Model) (Two Class) model is: 87.50443577004968%


              precision    recall  f1-score   support

           0       0.79      0.97      0.87      9711
           1       0.97      0.80      0.88     12833

   micro avg       0.88      0.88      0.88     22544
   macro avg       0.88      0.89      0.87     22544
weighted avg       0.89      0.88      0.88     22544







In [120]:
feature_importance = pd.DataFrame({'imp': xgb_pseudolabeler_twoclass_model.model.feature_importances_, 'col': X_train.columns})
feature_importance = feature_importance.sort_values(['imp', 'col'], ascending=[True, False]).iloc[-30:]
feature_importance.plot(kind='barh', x='col', y='imp', color=belize_light_flavor);
plt.title('Binomial Classification (XGBoost (Semi-Supervised Model) - Feature Importance(s))', fontsize=18)
plt.show()

Multinomial Classification: Activity is normal or DOS or PROBE or R2L or U2R

Multinomial Classification: Using LightGBM Classifier

Perform Classification

In [121]:
def lgbm_status_print_fiveclass(optimal_result):
    all_models = pd.DataFrame(lgbm_bayes_cv_tuner_fiveclass.cv_results_)
    best_params = pd.Series(lgbm_bayes_cv_tuner_fiveclass.best_params_)
    print('Model #{}\nBest ROC-AUC: {}\nBest params: {}\n'.format(len(all_models),
            np.round(lgbm_bayes_cv_tuner_fiveclass.best_score_, 4),
            lgbm_bayes_cv_tuner_fiveclass.best_params_)
         )


lgbm_bayes_cv_tuner_fiveclass = BayesSearchCV(
    estimator=lgbm.LGBMClassifier(
        n_jobs=n_cpus_avaliable,
        objective='multiclass',
        metric='multi_logloss',
        num_class=5,
        class_weight='balanced',
        silent=True,
        ),
    search_spaces={
        'boosting_type': ['gbdt', 'dart', 'rf'],
        'num_leaves': (1, 50),
        'max_depth': (1, 25),
        'learning_rate': (0.01, 1.0, 'log-uniform'),
        'n_estimators': (100, 300),
        'min_split_gain': (0.01, 1.0, 'uniform'),
        'min_child_weight': (0.01, 1.0, 'uniform'),
        'min_child_samples': (1, 50),
        'subsample': (0.01, 1.0, 'uniform'),
        'subsample_freq': (1, 50),
        'colsample_bytree': (0.01, 1.0, 'uniform'),
        'reg_lambda': (1e-5, 1000, 'log-uniform'),
        'bagging_fraction': (0.01, 1.0, 'uniform'),
        'feature_fraction': (0.01, 1.0, 'uniform'),
        },
    cv=StratifiedKFold(n_splits=9, shuffle=True),
    n_jobs=n_cpus_avaliable,
    n_iter=5,
    refit=True,
    verbose=0,
    )

lgbm_result_fiveclass = lgbm_bayes_cv_tuner_fiveclass.fit(X_train, z_train,
        callback=lgbm_status_print_fiveclass)

lgbm_fiveclass_model = lgbm_result_fiveclass.best_estimator_

print(lgbm_fiveclass_model)
Model #1
Best ROC-AUC: 0.7268
Best params: {'bagging_fraction': 0.4243946402705176, 'boosting_type': 'gbdt', 'colsample_bytree': 0.5348822432309208, 'feature_fraction': 0.37116322617197167, 'learning_rate': 0.5999551979992982, 'max_depth': 13, 'min_child_samples': 49, 'min_child_weight': 0.09418800799600857, 'min_split_gain': 0.8894396799218202, 'n_estimators': 177, 'num_leaves': 15, 'reg_lambda': 0.0004224316991956459, 'subsample': 0.06452190756392076, 'subsample_freq': 19}

Model #2
Best ROC-AUC: 0.9988
Best params: {'bagging_fraction': 0.9450203375047398, 'boosting_type': 'gbdt', 'colsample_bytree': 0.04305679468979689, 'feature_fraction': 0.17325012165161896, 'learning_rate': 0.17974925620556156, 'max_depth': 4, 'min_child_samples': 17, 'min_child_weight': 0.17322140331526656, 'min_split_gain': 0.19216941611238322, 'n_estimators': 148, 'num_leaves': 34, 'reg_lambda': 0.054818067231260946, 'subsample': 0.7518340765084419, 'subsample_freq': 25}

Model #3
Best ROC-AUC: 0.9988
Best params: {'bagging_fraction': 0.9450203375047398, 'boosting_type': 'gbdt', 'colsample_bytree': 0.04305679468979689, 'feature_fraction': 0.17325012165161896, 'learning_rate': 0.17974925620556156, 'max_depth': 4, 'min_child_samples': 17, 'min_child_weight': 0.17322140331526656, 'min_split_gain': 0.19216941611238322, 'n_estimators': 148, 'num_leaves': 34, 'reg_lambda': 0.054818067231260946, 'subsample': 0.7518340765084419, 'subsample_freq': 25}

Model #4
Best ROC-AUC: 0.9988
Best params: {'bagging_fraction': 0.9450203375047398, 'boosting_type': 'gbdt', 'colsample_bytree': 0.04305679468979689, 'feature_fraction': 0.17325012165161896, 'learning_rate': 0.17974925620556156, 'max_depth': 4, 'min_child_samples': 17, 'min_child_weight': 0.17322140331526656, 'min_split_gain': 0.19216941611238322, 'n_estimators': 148, 'num_leaves': 34, 'reg_lambda': 0.054818067231260946, 'subsample': 0.7518340765084419, 'subsample_freq': 25}

Model #5
Best ROC-AUC: 0.9988
Best params: {'bagging_fraction': 0.9450203375047398, 'boosting_type': 'gbdt', 'colsample_bytree': 0.04305679468979689, 'feature_fraction': 0.17325012165161896, 'learning_rate': 0.17974925620556156, 'max_depth': 4, 'min_child_samples': 17, 'min_child_weight': 0.17322140331526656, 'min_split_gain': 0.19216941611238322, 'n_estimators': 148, 'num_leaves': 34, 'reg_lambda': 0.054818067231260946, 'subsample': 0.7518340765084419, 'subsample_freq': 25}

LGBMClassifier(bagging_fraction=0.9450203375047398, boosting_type='gbdt',
        class_weight='balanced', colsample_bytree=0.04305679468979689,
        feature_fraction=0.17325012165161896, importance_type='split',
        learning_rate=0.17974925620556156, max_depth=4,
        metric='multi_logloss', min_child_samples=17,
        min_child_weight=0.17322140331526656,
        min_split_gain=0.19216941611238322, n_estimators=148, n_jobs=4,
        num_class=5, num_leaves=34, objective='multiclass',
        random_state=None, reg_alpha=0.0, reg_lambda=0.054818067231260946,
        silent=True, subsample=0.7518340765084419,
        subsample_for_bin=200000, subsample_freq=25)
In [122]:
# Predicting the Test set results

z_pred = lgbm_fiveclass_model.predict(X_test)
z_pred_proba = lgbm_fiveclass_model.predict_proba(X_test)

ac = accuracy_score(z_test, z_pred)
print('The accuracy score of the LightGBM (Five Class) model is: {}%'.format(ac
        * 100))
print('\n')

cr = classification_report(z_test, z_pred)
print(cr)
print('\n')

skplt.metrics.plot_confusion_matrix(z_test, z_pred,
                                    title='Multinomial Classification (LightGBM Confusion Matrix)'
                                    , x_tick_rotation=90, cmap='Oranges'
                                    )
print('\n')

skplt.metrics.plot_precision_recall(z_test, z_pred_proba,
                                    title='Multinomial Classification (LightGBM Precision-Recall Curve)'
                                    )
print('\n')

skplt.metrics.plot_roc(z_test, z_pred_proba,
                       title='Multinomial Classification (LightGBM ROC Curves)'
                       )
The accuracy score of the LightGBM (Five Class) model is: 86.11603974449964%


              precision    recall  f1-score   support

           0       0.81      0.97      0.89     11235
           1       0.96      0.83      0.89      7167
           2       0.84      0.71      0.77      2421
           3       0.94      0.50      0.65      1654
           4       0.57      0.51      0.54        67

   micro avg       0.86      0.86      0.86     22544
   macro avg       0.82      0.70      0.75     22544
weighted avg       0.87      0.86      0.86     22544







Out[122]:
<matplotlib.axes._subplots.AxesSubplot at 0x2682961c9e8>
In [123]:
feature_importance = \
    pd.DataFrame({'imp': lgbm_fiveclass_model.feature_importances_,
                 'col': X_train.columns})
feature_importance = feature_importance.sort_values(['imp', 'col'],
        ascending=[True, False]).iloc[-30:]
feature_importance.plot(kind='barh', x='col', y='imp', color=belize_light_flavor)
plt.title('Multinomial Classification (LightGBM - Feature Importance(s))'
          , fontsize=18)
plt.show()

Multinomial Classification: Using XGBoost Classifier

Perform Classification

In [124]:
def status_print(optimal_result):
    all_models = pd.DataFrame(bayes_cv_tuner.cv_results_)
    best_params = pd.Series(bayes_cv_tuner.best_params_)
    print('Model #{}\nBest ROC-AUC: {}\nBest params: {}\n'.format(len(all_models),
            np.round(bayes_cv_tuner.best_score_, 4),
            bayes_cv_tuner.best_params_)
         )


bayes_cv_tuner = BayesSearchCV(
    estimator=xgb.XGBClassifier(
        n_jobs=n_cpus_avaliable,
        objective='binary:logistic',
        eval_metric='auc',
        silent=1,
        tree_method='approx',
        nthread=n_cpus_avaliable,
        ),
    search_spaces={
        # 'booster': ['gbtree', 'dart'],
        'learning_rate': (0.01, 1.0, 'log-uniform'),
        'min_child_weight': (0, 10),
        'max_depth': (0, 25),
        'max_delta_step': (0, 20),
        'subsample': (0.01, 1.0, 'uniform'),
        'colsample_bytree': (0.01, 1.0, 'uniform'),
        'colsample_bylevel': (0.01, 1.0, 'uniform'),
        'reg_lambda': (1e-9, 1000, 'log-uniform'),
        'gamma': (1e-9, 0.5, 'log-uniform'),
        'min_child_weight': (0, 10),
        'n_estimators': (100, 300),
        'scale_pos_weight': (1e-6, 500, 'log-uniform'),
        },
    cv=StratifiedKFold(n_splits=9, shuffle=True),
    n_jobs=n_cpus_avaliable,
    n_iter=9,
    refit=True,
    verbose=0,
    )

result = bayes_cv_tuner.fit(X_train, z_train, callback=status_print)

xgb_fiveclass_model = result.best_estimator_

print(xgb_fiveclass_model)
Model #1
Best ROC-AUC: 0.9994
Best params: {'colsample_bylevel': 0.5357859704362009, 'colsample_bytree': 0.2473515374734547, 'gamma': 3.5416747484490386e-08, 'learning_rate': 0.41290754204633556, 'max_delta_step': 9, 'max_depth': 20, 'min_child_weight': 4, 'n_estimators': 141, 'reg_lambda': 2.2331338231663726e-05, 'scale_pos_weight': 1.7723736487126073e-05, 'subsample': 0.31091737797485103}

Model #2
Best ROC-AUC: 0.9994
Best params: {'colsample_bylevel': 0.5357859704362009, 'colsample_bytree': 0.2473515374734547, 'gamma': 3.5416747484490386e-08, 'learning_rate': 0.41290754204633556, 'max_delta_step': 9, 'max_depth': 20, 'min_child_weight': 4, 'n_estimators': 141, 'reg_lambda': 2.2331338231663726e-05, 'scale_pos_weight': 1.7723736487126073e-05, 'subsample': 0.31091737797485103}

Model #3
Best ROC-AUC: 0.9994
Best params: {'colsample_bylevel': 0.5357859704362009, 'colsample_bytree': 0.2473515374734547, 'gamma': 3.5416747484490386e-08, 'learning_rate': 0.41290754204633556, 'max_delta_step': 9, 'max_depth': 20, 'min_child_weight': 4, 'n_estimators': 141, 'reg_lambda': 2.2331338231663726e-05, 'scale_pos_weight': 1.7723736487126073e-05, 'subsample': 0.31091737797485103}

Model #4
Best ROC-AUC: 0.9994
Best params: {'colsample_bylevel': 0.5357859704362009, 'colsample_bytree': 0.2473515374734547, 'gamma': 3.5416747484490386e-08, 'learning_rate': 0.41290754204633556, 'max_delta_step': 9, 'max_depth': 20, 'min_child_weight': 4, 'n_estimators': 141, 'reg_lambda': 2.2331338231663726e-05, 'scale_pos_weight': 1.7723736487126073e-05, 'subsample': 0.31091737797485103}

Model #5
Best ROC-AUC: 0.9994
Best params: {'colsample_bylevel': 0.5357859704362009, 'colsample_bytree': 0.2473515374734547, 'gamma': 3.5416747484490386e-08, 'learning_rate': 0.41290754204633556, 'max_delta_step': 9, 'max_depth': 20, 'min_child_weight': 4, 'n_estimators': 141, 'reg_lambda': 2.2331338231663726e-05, 'scale_pos_weight': 1.7723736487126073e-05, 'subsample': 0.31091737797485103}

Model #6
Best ROC-AUC: 0.9994
Best params: {'colsample_bylevel': 0.5357859704362009, 'colsample_bytree': 0.2473515374734547, 'gamma': 3.5416747484490386e-08, 'learning_rate': 0.41290754204633556, 'max_delta_step': 9, 'max_depth': 20, 'min_child_weight': 4, 'n_estimators': 141, 'reg_lambda': 2.2331338231663726e-05, 'scale_pos_weight': 1.7723736487126073e-05, 'subsample': 0.31091737797485103}

Model #7
Best ROC-AUC: 0.9994
Best params: {'colsample_bylevel': 0.5357859704362009, 'colsample_bytree': 0.2473515374734547, 'gamma': 3.5416747484490386e-08, 'learning_rate': 0.41290754204633556, 'max_delta_step': 9, 'max_depth': 20, 'min_child_weight': 4, 'n_estimators': 141, 'reg_lambda': 2.2331338231663726e-05, 'scale_pos_weight': 1.7723736487126073e-05, 'subsample': 0.31091737797485103}

Model #8
Best ROC-AUC: 0.9994
Best params: {'colsample_bylevel': 0.5357859704362009, 'colsample_bytree': 0.2473515374734547, 'gamma': 3.5416747484490386e-08, 'learning_rate': 0.41290754204633556, 'max_delta_step': 9, 'max_depth': 20, 'min_child_weight': 4, 'n_estimators': 141, 'reg_lambda': 2.2331338231663726e-05, 'scale_pos_weight': 1.7723736487126073e-05, 'subsample': 0.31091737797485103}

Model #9
Best ROC-AUC: 0.9994
Best params: {'colsample_bylevel': 0.5357859704362009, 'colsample_bytree': 0.2473515374734547, 'gamma': 3.5416747484490386e-08, 'learning_rate': 0.41290754204633556, 'max_delta_step': 9, 'max_depth': 20, 'min_child_weight': 4, 'n_estimators': 141, 'reg_lambda': 2.2331338231663726e-05, 'scale_pos_weight': 1.7723736487126073e-05, 'subsample': 0.31091737797485103}

[21:26:37] Tree method is selected to be 'approx'
XGBClassifier(base_score=0.5, booster='gbtree',
       colsample_bylevel=0.5357859704362009,
       colsample_bytree=0.2473515374734547, eval_metric='auc',
       gamma=3.5416747484490386e-08, learning_rate=0.41290754204633556,
       max_delta_step=9, max_depth=20, min_child_weight=4, missing=None,
       n_estimators=141, n_jobs=4, nthread=4, objective='multi:softprob',
       random_state=0, reg_alpha=0, reg_lambda=2.2331338231663726e-05,
       scale_pos_weight=1.7723736487126073e-05, seed=None, silent=1,
       subsample=0.31091737797485103, tree_method='approx')
In [125]:
# Predicting the Test set results

z_pred = xgb_fiveclass_model.predict(X_test)
z_pred_proba = xgb_fiveclass_model.predict_proba(X_test)

ac = accuracy_score(z_test, z_pred)
print('The accuracy score of the XGBoost (Five Class) model is: {}%'.format(ac
        * 100))
print('\n')

cr = classification_report(z_test, z_pred)
print(cr)
print('\n')

skplt.metrics.plot_confusion_matrix(z_test, z_pred,
                                    title='Multinomial Classification (XGBoost Confusion Matrix)'
                                    , x_tick_rotation=90, cmap='Oranges'
                                    )
print('\n')

skplt.metrics.plot_precision_recall(z_test, z_pred_proba,
                                    title='Multinomial Classification (XGBoost Precision-Recall Curve)'
                                    )
print('\n')

skplt.metrics.plot_roc(z_test, z_pred_proba,
                       title='Multinomial Classification (XGBoost ROC Curves)'
                       )
The accuracy score of the XGBoost (Five Class) model is: 83.48562810503904%


              precision    recall  f1-score   support

           0       0.77      0.98      0.86     11235
           1       0.97      0.82      0.89      7167
           2       0.88      0.71      0.79      2421
           3       0.92      0.13      0.23      1654
           4       0.80      0.36      0.49        67

   micro avg       0.83      0.83      0.83     22544
   macro avg       0.87      0.60      0.65     22544
weighted avg       0.86      0.83      0.81     22544







Out[125]:
<matplotlib.axes._subplots.AxesSubplot at 0x2682a955208>
In [126]:
feature_importance = pd.DataFrame({'imp': xgb_fiveclass_model.feature_importances_, 'col': X_train.columns})
feature_importance = feature_importance.sort_values(['imp', 'col'], ascending=[True, False]).iloc[-30:]
feature_importance.plot(kind='barh', x='col', y='imp', color=belize_light_flavor);
plt.title('Multinomial Classification (XGBoost - Feature Importance(s))', fontsize=18)
plt.show()

Multinomial Classification: Using XGBoost with Semi-Supervised Classifier

Perform Classification

In [127]:
features = X_train.columns
target = 'attack_type_fiveclass_num'
num_folds = 11

xgb_pseudolabeler_fiveclass_model = PseudoLabeler(
    xgb_fiveclass_model,
    X_test,
    features,
    target,
    sample_rate = 0.3
)

xgb_pseudolabeler_fiveclass_model.fit(X_train, z_train,)
#z_pred = xgb_pseudolabeler_fiveclass_model.predict(X_test)
scores = cross_val_score(xgb_pseudolabeler_fiveclass_model, X_train, z_train, cv=num_folds, n_jobs=1)
scores
[21:26:57] Tree method is selected to be 'approx'
[21:27:13] Tree method is selected to be 'approx'
[21:27:31] Tree method is selected to be 'approx'
[21:27:47] Tree method is selected to be 'approx'
[21:28:03] Tree method is selected to be 'approx'
[21:28:18] Tree method is selected to be 'approx'
[21:28:35] Tree method is selected to be 'approx'
[21:28:50] Tree method is selected to be 'approx'
[21:29:06] Tree method is selected to be 'approx'
[21:29:20] Tree method is selected to be 'approx'
[21:29:37] Tree method is selected to be 'approx'
[21:29:53] Tree method is selected to be 'approx'
[21:30:09] Tree method is selected to be 'approx'
[21:30:24] Tree method is selected to be 'approx'
[21:30:41] Tree method is selected to be 'approx'
[21:30:55] Tree method is selected to be 'approx'
[21:31:11] Tree method is selected to be 'approx'
[21:31:25] Tree method is selected to be 'approx'
[21:31:42] Tree method is selected to be 'approx'
[21:31:59] Tree method is selected to be 'approx'
[21:32:16] Tree method is selected to be 'approx'
[21:32:30] Tree method is selected to be 'approx'
[21:32:47] Tree method is selected to be 'approx'
[21:33:01] Tree method is selected to be 'approx'
Out[127]:
array([0.99973808, 0.99895233, 0.99947612, 0.99956343, 0.99956343,
       0.99903955, 0.99903955, 0.99912671, 0.99938865, 0.99947598,
       0.99947594])
In [128]:
xgb_pseudolabeler_fiveclass_model
Out[128]:
PseudoLabeler(features=Index(['duration', 'protocol_type', 'service', 'flag', 'src_bytes',
       'dst_bytes', 'land', 'wrong_fragment', 'urgent', 'hot',
       'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell',
       'su_attempted', 'num_root', 'num_file_creations', 'num_shells',
       'num_acc...ate',
       'dst_host_rerror_rate', 'dst_host_srv_rerror_rate', 'last_flag'],
      dtype='object'),
       model=XGBClassifier(base_score=0.5, booster='gbtree',
       colsample_bylevel=0.5357859704362009,
       colsample_bytree=0.2473515374734547, eval_metric='auc',
       gamma=3.5416747484490386e-08, learning_rate=0.41290754204633556,
       max_delta_step=9, max_depth=20, min_child_weight=4, missing...7723736487126073e-05, seed=42, silent=1,
       subsample=0.31091737797485103, tree_method='approx'),
       sample_rate=0.3, seed=42, target='attack_type_fiveclass_num',
       unlabled_data=       duration  protocol_type  service  flag  src_bytes  dst_bytes  land  \
0             0              0        2     2          0          0     0
1             0              0        2     2          0          0     0
2             2              0        0     0      1298...  21
22543                  0.44                      1.00         14

[22544 rows x 36 columns])
In [129]:
z_pred = xgb_pseudolabeler_fiveclass_model.predict(X_test)
z_pred_proba = xgb_pseudolabeler_fiveclass_model.predict_proba(X_test)

ac = accuracy_score(z_test, z_pred)
print('The accuracy score of the XGBoost (Semi-Supervised Model) (Five Class) model is: {}%'.format(ac * 100))
print('\n')

cr = classification_report(z_test, z_pred)
print(cr)
print('\n')

skplt.metrics.plot_confusion_matrix(z_test, z_pred, title='Multinomial Classification (XGBoost (Semi-Supervised Model) Confusion Matrix)',
                                    x_tick_rotation=90, 
                                    cmap='Oranges',
                                   );
print('\n')

skplt.metrics.plot_precision_recall(z_test, z_pred_proba,
                                    title='Multinomial Classification (XGBoost (Semi-Supervised Model) Precision-Recall Curve)',
                                   );
print('\n')

skplt.metrics.plot_roc(z_test, z_pred_proba,
                       title='Multinomial Classification (XGBoost (Semi-Supervised Model) ROC Curves)',
                      );
The accuracy score of the XGBoost (Semi-Supervised Model) (Five Class) model is: 83.05535841022001%


              precision    recall  f1-score   support

           0       0.77      0.97      0.86     11235
           1       0.97      0.83      0.89      7167
           2       0.83      0.64      0.72      2421
           3       0.87      0.15      0.26      1654
           4       0.81      0.37      0.51        67

   micro avg       0.83      0.83      0.83     22544
   macro avg       0.85      0.59      0.65     22544
weighted avg       0.85      0.83      0.81     22544







In [130]:
feature_importance = pd.DataFrame({'imp': xgb_pseudolabeler_fiveclass_model.model.feature_importances_, 'col': X_train.columns})
feature_importance = feature_importance.sort_values(['imp', 'col'], ascending=[True, False]).iloc[-30:]
feature_importance.plot(kind='barh', x='col', y='imp', color=belize_light_flavor);
plt.title('Multinomial Classification (XGBoost (Semi-Supervised Model) - Feature Importance(s))', fontsize=18)
plt.show()
In [131]:
print("Notebook Runtime: %0.2f Minutes"%((time.time() - notebookstart)/60))
Notebook Runtime: 90.68 Minutes